Abstract: Estimating the mode or modal-sets (i.e. extrema points or surfaces) of an unknown density from sample is a basic problem in data analysis. Such estimation is relevant to other problems such as clustering, outlier detection, or can simply serve to identify low-dimensional structures in high dimensional-data (e.g. point-cloud data from medical-imaging, astronomy, etc). Theoretical work on mode-estimation has largely concentrated on understanding its statistical difficulty, while less attention has been given to implementable procedures. Thus, theoretical estimators, which are often statistically optimal, are for the most part hard to implement. Furthermore for more general modal-sets (general extrema of any dimension and shape) much less is known, although various existing procedures (e.g. for manifold-denoising or density-ridge estimation) have similar practical aim. I’ll present two related contributions of independent interest: (1) practical estimators of modal-sets – based on particular subgraphs of a k-NN graph – which attain minimax-optimal rates under surprisingly general distributional conditions; (2) high-probability finite sample rates for k-NN density estimation which is at the heart of our analysis. Finally, I’ll discuss recent successful work towards the deployment of these modal-sets estimators for various clustering applications.
Much of the talk is based on a series of work with collaborators S. Dasgupta, K. Chaudhuri, U. von Luxburg, and Heinrich Jiang.
Speaker Biography: Samory Kpotufe graduated with a PhD, in Sept 2010, from Computer Science at the University of California, San Diego; his advisor was Dr. Sanjoy Dasgupta. Dr. Kpotufe then joined Max Planck Institute for Intelligent Systems as a research in the department of Bernhard Schoelkopf, in the learning theory group of Ulrike von Luxburg. Following his work at MPI, Dr. Kpotufe was an Assistant Research Professor at the Toyota Technological Institute at Chicago. Prof. Kpotufe is now the Assistant Professor of Operations Research and Financial Engineering at Princeton University.
Speaker's research interests: I work in machine learning, with an emphasis on nonparametric methods and high-dimensional statistics. Generally, I’m interested in understanding the inherent difficulty of high-dimensional problems, under practical constraints from real-world application domains. The nonparametric setting is attractive in that it captures scenarios where we have little domain knowledge, which is important as data sciences reach into a diverse range of applications.
My main practical aim is to design adaptive procedures, i.e., practical procedures that can self-tune to unknown structure in data (e.g., manifold, sparsity, clusters), while at the same time meeting the various constraints (e.g., time, space, labeling cost) of modern applications.
For more, here is a recent research statement.