Approximating Dirichlet Process Mixture Models in Stan
Dirichlet process mixture models (DPMM) are useful for inferring clusters of N observations (or J effects) when the number of clusters K is unknown. Fitting a DPMM in Stan is challenging, however, because it involves latent discrete configurations that assign N observations or J effects to K clusters, each with K parameters. One approach is to integrate over all possible configurations, but this requires either conjugate prior distributions or sampling all parameters for all configurations. For a DPMM of J effects with a normal base distribution, we introduce an approximation to the DPMM that also integrates over all possible configurations but only samples J parameters, rather than all parameters for all configurations. This involves redefining the mixture over the covariance matrix for the J parameters. These covariance matrices are singular when K < J, impeding mixing and reducing effective sample size in Stan. Our solution adds a small amount of independent noise to the J parameters to ensure that all covariance matrices are nonsingular. This defines an approximate DPMM in which effects that are clustered together are not identical, but instead are very close together. This approximation improves mixing and greatly simplifies the Stan model. We demonstrate our approach for a mixture of J=8 effects, with 4,140 possible configurations and 8 (rather than 17,007) parameters. Our demonstration includes an additional layer of mixture by probabilistically assigning the N observations to S=36 states, where the effect of each state is an additive function of the J=8 effects. Calculating the likelihood for this model under the DPMM requires 36 x 4140 = 149,040 computations per observation; for the approximate DPMM, it requires only 36. Thus, the approximate DPMM is a promising avenue for fitting simplified, extensible mixture models in Stan.
I am a postdoctoral researcher in the Department of Genetics at the University of North Carolina at Chapel Hill. I recently completed my PhD in Bioinformatics and Computational Biology at UNC in 2019, and I completed my undergraduate degree in Mathematics and Economics at UNC in 2011. My current research interests include statistical genetics, Bayesian nonparametric models, and Bayesian model selection, with applications to biomedical traits and -omics data.