Statistical models can help discover structure in data, such as connectivity patterns in neuronal networks. The data scientist, however, faces the possibility of model misspecification, as well as the danger of drawing invalid conclusions from evaluating too many models on the same data. We will discuss three vignettes that attempt to address some of these concerns.
First, we argue that cross-validation---a method often used for model selection---is not the best statistical procedure. Our theoretical analysis suggests a different (and optimal) method, and we demonstrate its effectiveness in practice. Second, we discuss procedures for estimating patterns in networks or genomics data for several natural statistical models. Finally, we describe a sequential prediction paradigm where data arrive in an online fashion. We propose a method for choosing dynamic treatment regimes that is robust to model misspecification.