Skip to main content



About Me

I am currently working as a software engineer at Google Inc. In the summer of 2015 I obtained my Ph.D. from the School of Operations Research and Information Engineering at Cornell University. During my time at Cornell my Ph.D. advisor was David S. Matteson.

I received my B.S. in Mathematics, with a minor in Computer Science, from the University of Florida.

Research

My research interest include time series, nonparametric statistics, and machine learning. My current research is focused on nonparametric methodologies for performing change point analysis of multivariate data. Change point analysis, which pertains to detecting distributional changes in time ordered observations, has applications in a variety of fields. Such fields include economics, finance, genetics, and medical diagnostics.

Publications and Submitted Papers

  • A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data
    Journal of the American Statistical Association, Vol. 109, No. 505: 334-345. Joint work with David S. Matteson. Manuscript (pdf)

    Change point analysis has applications in a wide variety of fields. The general problem concerns the inference of a change in distribution for a set of time-ordered observations. Sequential detection is an online version in which new data is continually arriving and is analyzed adaptively. We are concerned with the related, but distinct, offline version, in which retrospective analysis of an entire sequence is performed. For a set of multivariate observations of arbitrary dimension, we consider nonparametric estimation of both the number of change points and the positions at which they occur. We do not make any assumptions regarding the nature of the change in distribution or any distribution assumptions beyond the existence of the αth absolute moment, for some α ∈ (0,2). Estimation is based on hierarchical clustering and we propose both divisive and agglomerative algorithms. The divisive method is shown to provide consistent estimates of both the number and location of change points under standard regularity assumptions. We compare the proposed approach with competing methods in a simulation study. Methods from cluster analysis are applied to assess performance and to allow simple comparisons of location estimates, even when the estimated number differs. We conclude with applications in genetics, finance and spatio-temporal analysis.
  • Locally Stationary Vector Processes and Adaptive Multivariate Modeling
    Acoustics, Speech and Signal Processing, IEEE, 8722 - 8726. Joint work with David S. Matteson, William B. Nicholson, and Louis C. Segalini. Manuscript (pdf)

    The assumption of strict stationarity is often too strong for observations in many time series applications; however, distributional properties may be at least locally stable in time. We define multivariate measures of homogeneity to quantify local stationarity and an empirical approach for robustly estimating time varying windows of stationarity. Finally, we consider a bivariate series that is believed to be cointegrated locally, assess our estimates, and discuss applications in financial asset pairs trading.
  • ecp: An R Package for Nonparametric Change Point Analysis of Multivariate Data
    Journal of Statistical Software, Vol. 62, No. 7: 1-25. Joint work with David S. Matteson. Manuscript (pdf); R package (ecp)

    There are many different ways in which change point analysis can be performed, from purely parametric methods to those that are distribution free. The ecp package is designed to perform multiple change point analysis while making as few assumptions as possible. While many other change point methods are applicable only for univariate data, this R package is suitable for both univariate and multivariate observations. Estimation can be based upon either a hierarchical divisive or agglomerative algorithm. Divisive estimation sequentially identifies change points via a bisection algorithm. The agglomerative algorithm estimates change point locations by determining an optimal segmentation. Both approaches are able to detect any type of distributional change within the data. This provides an advantage over many existing change point algorithms which are only able to detect changes within the marginal distributions.
  • Leveraging Cloud Data to Mitigate User Experience from 'Breaking Bad'
    arXiv:1411.7955 Joint work with Arun Kejariwal and David S. Matteson. Manuscritp (pdf)

    Low latency and high availability of an app or a web service are key, amongst other factors, to the overall user experience (which in turn directly impacts the bottomline). Exogenic and/or endogenic factors often give rise to breakouts in cloud data which makes maintaining high availability and delivering high performance very challenging. Although there exists a large body of prior research in breakout detection, existing techniques are not suitable for detecting breakouts in cloud data owing to being not robust in the presence of anomalies.
    To this end, we developed a novel statistical technique to automatically detect breakouts in cloud data. In particular, the technique employs Energy Statistics to detect breakouts in both application as well as system metrics. Further, the technique uses robust statistical metrics, viz., median, and estimates the statistical significance of a breakout through a permutation test. To the best of our knowledge, this is the first work which addresses breakout detection in the presence of anomalies.
    We demonstrate the efficacy of the proposed technique using production data and report Precision, Recall and F-measure measure. The proposed technique is 3.5 times faster than a state-of-the-art technique for breakout detection and is being currently used on a daily basis at Twitter.
  • Change Points via Probabilistically Pruned Objectives
    Submitted. Joint work with David S. Matteson. Manuscript (pdf)

    The concept of homogeneity plays a critical role in statistics, both in its applications as well as its theory. Change point analysis is a statistical tool that aims to attain homogeneity within time series data. This is accomplished through partitioning the time series into a number of contiguous homogeneous segments. The applications of such techniques range from identifying chromosome alterations to solar flare detection. In this manuscript we present a general purpose search algorithm called cp3o that can be used to identify change points in multivariate time series. This new search procedure can be applied with a large class of goodness of fit measures. Additionally, a reduction in the computational time needed to identify change points is accomplish by means of probabilistic pruning. With mild assumptions about the goodness of fit measure this new search algorithm is shown to generate consistent estimates for both the number of change points and their locations, even when the number of change points increases with the time series length.
    A change point algorithm that incorporates the cp3o search algorithm and E-Statistics, e-cp3o, is also presented. The only distributional assumption that the e-cp3o procedure makes is that the absolute αth moment exists, for some α ∈ (0,2). Due to this mild restriction, the e-cp3o procedure can be applied to a majority of change point problems. Furthermore, even with such a mild restriction, the e-cp3o procedure has the ability to detect any type of distributional change within a time series. Simulation studies are used to compare the e-cp3o procedure to other parametric and nonparametric change point procedures, we highlight applications of e-cp3o to climate and financial datasets.

Works in Preparation

  • multidcov: An R Package for Independent Component Analysis and Test of Independence via Multivariate Distance Covariance
    Joint work with Benjamin B. Risk and David S. Matteson.

Software

  • ecp
    This is an R package for performing multiple change point analysis of multivariate data. The methodologies implemented in this package are those described in A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data.
  • edm
    This is an R package for performing multiple change point analysis of multivariate data in the presence of anomalies. The methodologies implemented in this package are those described in Leveraging Cloud Data to Mitigate User Experience from 'Breaking Bad'. A verion of this package is also available on GitHub.

Curriculum Vitae


Contact Information

294 Rhodes Hall
Cornell University
Ithaca, NY 14853
nj89 at cornell.edu

Arecibo at Dusk

Rhodes Hall