Cornell University

Current Research Directions

There are a number of areas in this problem that need further investigation. We are currently working on some of these, but others are virtually untouched.

Statistical inference. The "conditional MLE" (conditional on the number of observed species), which is our principal point estimator, has been examined from a number of angles and is known to perform well (if the model is a reasonable approximation to reality). It is strongly consistent; it is an empirical modification of an unbiased estimator; it is the Horvitz-Thompson estimator in this case, etc. Hence we believe its performance to be well-understood and reliable (up to, as mentioned above, the suitability of the parametric model). However, variance estimation is less firmly established. We currently have three formulas for standard errors for the parametric maximum likelihood-based estimate of the number of classes: a lower bound for the standard error, and two slightly different formulas based on different approaches to asymptotic (large-sample) theory. The lower bound is straightforward to compute, but given its value the user does not know how much greater the true standard error may be. The other formulas involve significant computation.

There are two main strategies for variance estimation: direct estimates based on asymptotic formulae; and resampling, specifically the bootstrap. (This holds for both the parametric and the coverage-based nonparametric methods; the nonparametric maximum likelihood methods appear to use only resampling. No exact small-sample variance results are known for any version of this problem.) Sanathanan (1977) provided an asymptotic variance formula for the conditional and unconditional MLE's. This is the basis of our standard errors. However, the precision of this formula is unknown for small samples, and remains to be investigated, whether analytically or by simulation. In some cases one obtains a standard error such that the normal-theory 95% confidence interval (point estimate) +/- 1.96*SE has a lower endpoint below the necessary lower bound, which is the number of actually observed species or classes. This may indicate imprecision in the variance estimate; it certainly indicates inappropriateness of the standard normal-theory confidence interval (see below).

As an alternative, variance estimation by the bootstrap has been carried out by a number of authors in this problem (Bohning and Schon, 2005). Certain modifications to the bootstrap are required by the structure of this problem, and these have been applied with increasing sophistication in recent literature. However, the bootstrap may run afoul of the same problem noted above: the apparent range of variation of the estimate may actually hit the necessary lower bound (the observed number of classes). It is not presently clear whether this fact means that (a) bootstrap variance estimation is incorrect in this problem; (b) it is correct but only after careful modifications of the procedure; or (c) it is correct as is but requires a different interpretation. This is an open problem in mathematical statistics; we have not yet undertaken this study but our preliminary reading of the theoretical literature indicates that the problem is subtle and that the answer is not clear a priori.

We know that our estimator has an asymptotic normal distribution, so that for large samples the normal-theory confidence intervals are appropriate. However, the situation for small samples is less clear (even less than for variance estimation). At present mathematical statistics does not have an answer for this problem. Some solutions have been proposed. For example, the coverage-based nonparametric methods suffer from this same problem, so Chao (1987) proposed a transformed interval that avoids the lower bound of the observed number of species and could be adapted for use here; however, the properties of this interval have not been assessed theoretically (and again it relies on asymptotics). One could also compute direct resampling (bootstrap) -based intervals, but here again we are not certain of the application of the bootstrap in this problem. Another possibility is to use profile confidence intervals; but again the requisite theoretical development has not yet been carried out, and such intervals may also be affected by the lower bound problem out (although recent work of such as that of Lang may be of value here (Lang, J. (2005), Profile confidence intervals for contingency table parameters, Univ. of Iowa Tech. Report #351)). Thus we do not presently have a procedure that can guarantee the nominal coverage (say, 95%) in small samples.

The underlying issue is that in this problem, the operative parameter space (the range of possible values of the estimand, i.e., the number of species) is bounded by the random observables (in particular, the observed number of species). This fact violates a regularity condition that is required for a substantial proportion of classical statistical results, and consequently analysis of this problem lies outside of the scope of classical theory. It may be that only slight modifications to that theory are needed, or it may be that significantly deep analysis will be necessary.
Models. We seek to expand the scope of parametric models, i.e., to find new classes of mixing distributions. "New" here means new to this problem: in the past only a few mixing distributions have been considered in this context (exponential, gamma, inverse Gaussian, lognormal). The main goal is to find a class of mixing distributions that would (ideally) make it unnecessary to truncate the data at some upper frequency in order to obtain a good fit. Some possibilities include (but are not limited to):

The generalized inverse Gaussian family, first proposed in this application in by H.S. Sichel (see the Bibliography This is a three-parameter family (the inverse Gaussian is a two-parameter special case). While promising, the family is mathematically complicated and has proven computationally intractable to date.

The general class of mixed-exponential distributions. We can currently fit low-dimensional finite mixtures of exponentials, but the general family has not been explored in this problem. There is some recent research on the family, especially on computing, that may be helpful.

The "generalized gamma convolutions" set forth by Bondesson in his book of that title. There is some overlap with other families (and this group includes the gamma and exponential), but the group is relatively unexplored.

Heavy-tailed distributions such as the log-stable or the log-Cauchy. (The lognormal is a special case.) We are experimenting with a location/scale version of the log-Cauchy but we do not yet have a usable model-fitting algorithm for it.

Extensions of the Pareto, which seems to work well for high-diversity data. We are not sure at this point how to embed the Pareto in a larger family suitable for this application.

In fact, in a recent paper Karlis and Xekalaki (2005) gave an extended list of potential Poisson mixture models (along with considerable analysis and references to the literature), including the following possible mixing distributions in addition to those mentioned above; as far as we know, none of these has been applied in the species problem.

linear exponential family

Lindley

confluent hypergeometric series

inverse gamma

truncated normal

gamma product ratio

exponential ^ beta

generalized Pareto

beta types I & II

uniform

truncated gamma

generalized gamma

modified Bessel of the 3rd kind

Pearson's family

lomax

power variance family

other discrete distributions

Model selection. Currently we analyze the performance of all parametric models on all possible right-truncated subsets of the data, and choose the best performer on the basis of (i) two goodness-of-fit tests, (ii) ability to produce a reasonably small standard error, (iii) use of the maximum amount of the empirical frequency data (largest right truncation point), and (iv) various heuristic considerations. The concepts of model selection set forth in (for example) Burnham and Anderson (1998) (Model selection and inference: A practical information-theoretic approach, New York: Springer-Verlag) provide a different perspective on the competing models. Indeed our original idea was to rank the models according to one or more information-theoretic statistics such as the AIC (Akaike information criterion), and to use this to select our final analysis. We did attempt this, but we decided that, at this stage of our knowledge, the problem of model selection in this problem is still too multidimensional to permit such a reduction - in short, expert judgment still plays a role.

The information-theoretic criteria do take the number of model parameters into account. However, the numbers of parameters in our current models are: 1, for the (unmixed) Poisson, which assumes equal species abundances and hence never fits real data; 2, for the negative binomial, inverse Gaussian, lognormal and Pareto; and 3 for the 2-mixed exponential. Thus for realistic models we have 2 parameters in four cases and 3 in one case, so the question of different numbers of parameters does not play as large a role as it does in, say, a high-dimensional regression problem. (Once we implement more complicated finite mixtures, such as a 3-mixed exponential or a mixture of 2 lognormals, the number of parameters will rise significantly, and this issue will become important.)

Because the different models permit different right truncation points, the effective datasets (in particular the sample sizes) are not the same for all models. Such differences can be accommodated to some extent by the information-theoretic criteria, but we do not have enough empirical experience with this issue in this problem to apply these criteria with confidence under these circumstances. (The coverage-based nonparametric methods get around this by simply taking the right truncation point to be 10 by default.) Our ultimate goal, of course, is to find a family of parametric models (which will probably turn out to be finite mixtures of known distributions) that will fit complete empirical frequency datasets, in which case the problem of the right truncation point will disappear.

We are not fully satisfied with likelihood, the Pearson chi-square, or other goodness-of-fit statistics at this point, partly because in this problem we may wish to differentially weight the frequency curve, seeking better fit toward the left (rare species) and less stringent fit toward the right. At present we do not have a method for this. In short, at present we prefer to compare simple and well-understood statistics, and to take data-analytic and heuristic considerations into account, in selecting a final analysis. However, we envision incorporating the information-theoretic criteria into a more refined model-choice procedure, once the issues above (and others) have been resolved.
Computing. We have implemented a new numeric algorithm to compute the maximum likelihood estimates (and associated statistics) for the parametric models. This is an accelerated version of the EM (expectation-maximization) algorithm. In fact the EM algorithm for this problem was applied more or less simultaneously by us, in the parametric version, and by Mao, Bohning and others in the nonparametric maximum likelihood version. The Bohning/Schon (2005) paper has an analysis of the convergence of this algorithm. Our acceleration appears to be a version of an acceleration due to Aitkin which has never been applied in this problem (Bohning, personal communication). This algorithm allows us to obtain precise estimates within a given model, and estimates that are comparable across models from a numerical point of view. However, our current code is not fast, or not as fast as it could be, and would admit a number of incremental improvements. Larger-scale improvements would entail moving to a different computing platform (software and/or hardware); incorporating recursions for complicated formulas where possible, improved convergence-checking, etc. We also note that some of the parametric models present specific computational difficulties beyond the general accelerated EM algorithm mentioned above. In particular the lognormal and the Pareto require locally adaptive approximations to the probability mass function (the former by Riemann sums, the latter by another finite series representation). Such approximations can always be improved and made more efficient.
Nonparametrics. One natural extension of the parametric model-fitting approach is to allow a nonparametric mixing distribution, i.e., one that is not a member of a family specified by a finite number of parameters. There is some current and promising work in this regard (due to Mao, Bohning, and others), but the computation involved is still difficult, and the interpretation of the results will need refinement. The general mixed-exponential approach mentioned above constitutes one possible "semi-parametric" compromise. Following up on this point, there is a need for reconciliation of currently known coverage-based nonparametric methods with parametric (and eventually nonparametric) maximum likelihood model-based methods. This is a potentially fruitful area of research, because it would tell us the conditions under which the various estimators are likely to be effective, and how they relate to the various parametric models. Apart from some simulation comparisons, however, the only work in this direction (as far as we know) is the 2002 paper of Chao & Bunge.
Fully Bayesian methods. In a fully Bayesian approach one assigns a prior distribution (informative or noninformative) to the number of classes, and possibly "hyper-priors" to the parameters of the mixing distribution (or to a family of mixing distributions) as well. There is by now a considerable literature on these approaches, but they have not been unified or (generally) compared to the frequentist methods (parametric or nonparametric) discussed above. This is an excellent area for research but is beyond the scope of our current projects.
Several samples. In some situations we may have a number of distinct samples from the same population, and it may not be scientifically logical to simply pool them and regard the result as a single sample. Is there some way to exploit the division between the samples? While some of the extensive work in multiple-recapture theory may bear upon this problem, it is not clear at present how to use a mixture model (that is, a probability distribution for the sampling intensities of the various species) in this context. This is an open, and pressing, topic for research.

Estimating the Number of Classes in a Population

How many goodly creatures are there here! - The Tempest, V, i

Current Research Directions