4.2 Statistical analysis of bio-molecular data and combinatorial difficulties : two examples (Stéphane Robin)

Share:

Listens: 0

StatLearn 2010 - Workshop on "Challenging problems in Statistical Learning"

Education


Combinatorial issues are often raised by statistical model inference and selection, in particular when dealing with high-dimensional data. In such cases, asymptotic approximations or Monte-Carlo type methods are often used to approximate the quantities of interest. In this talk, we will present two examples dealing with bio-molecular data. In both of them exacts results can be obtained based on specific combinatorics and algorithmics developments. We will first consider the typical multiple testing issued that is faced when dealing with high-throughput data. In this framework, most multiple testing procedures require a precise estimation of the proportion of true null hypotheses. This estimation problem can be rephrased as an histogram selection problem, which can be solved via leave-p-out (LpO) cross-validation. We will present explicit results that allow us to manage this model selection problem, avoiding the computational burden inherent to LpO. We will then consider a segmentation problem encountered when looking for chromosomal aberrations based one microarray data. The detection of breakpoints and the estimation of their number is an old statistical problem. As for the precision of their localisation, only asymptotic results are available. We will present a dynamic programming type algorithm that allows us to explore the whole segmentation space. It provides information on the localisation precision. It furthermore provides a new model selection criterion for the number of breakpoints.