Date & Time: Monday, August 8, 2011 :: 2:30 p.m.

Venue: Ramanujan Hall

Title:Integrated Modelling and Inference for Heterogeneous Data

Speaker:Madhuchhanda Bhattacharjee, Department of Statistics, University of Pune, India

Abstract: In this talk I would present a framework for integrating diverse data sets under a coherent probabilistic setup. The necessity of a probabilistic modeling arises from the fact that data integration does not restrict to compiling information from data bases with data that are typically thought to be non-random. Currently wide range of experimental data is also available however rarely these data sets can be summarized in simple output data, e.g. in categorical form, moreover it may not even be appropriate to do so.

Also unlike the framework of meta-analysis, the composite data may comprise of several experiments which are meaningful but complementary in nature. The hypothesis driven fusion of these enabled us to drastically reduce the required experiment size. Note that even if we had the resource to carry out the full experiment some of these may not have been biologically feasible to do in reality.

Often the biological hypotheses emerge as the analysis progresses. Thus the framework should be flexible enough to translate these at varied stages of analysis in to the statistical framework as functions of parameters.

The different functional combinations of the model parameters cover a wide range of biological characteristics to be studied. This is another aspect of our modeling setup that is not easily available in the commonly used statistical tools, simply because complex hypotheses would require specialized testing procedure which may not be available readily. Complexity of individual modeling units was kept moderate to optimize computation time and parameter space of interest.

While analyzing using existing models quite often we observe that even moderate change in analysis technique for one data set and even for a single step of analysis can influence overall biological conclusions. It is well-known that this happens due to not propagating uncertainties in these analysis steps to subsequent steps. The Bayesian setup enables us to avoid this problem. Unfortunately analytic intractability is a common consequence of such complex models. We have succeeded in implementing this integrated model using available software, opening up varied modeling and input data-type possibilities.

Our objective has been to balance between quantity of data and quality of inference. Although one would be tempted to use as much data as possible, since most experimental data come with some degree of measurement error, summarizing the data and ignoring the noise potentially can (and often does) lead to non-reproducible results. Available data need to be integrated based on knowledge, followed by application of robust inference methods. The proposed setup allows coherent modeling of vast amount of observed data, robust inference of parameters of interest, incorporation of prior knowledge to address biologically relevant questions.