Summer School for
Statistics & Information Technology
Partly sponsored by Microsoft Research Asia
July 4-8, 2005
School of Mathematical Sciences, Peking University
Announcement Updated! Schedule Attendance
Schedule
Date
Invited Speakers
Topics
July 4
Jorma Rissanen
Modeling by the MDL Principle
Jun Liu
Modern Monte Carlo Methods and Their Applications
July 5
Jun Liu
Baining Guo
Bin Yu
Modern Monte Carlo Methods and Their Applications
Poisson Equation and Gradient Operators for Geometry and Images
Embracing Statistical Challenges in the Information Technology Age
July 6
Mark Hansen and Bin Yu
Information Theory and Statistics
July 7
John Rice
Statistical Methods and the Analysis of Traffic Data
Shao-Wei Cheng
Statistical Design of Experiments
July 8
Terry Speed
Applied statistics, models and algorithms
Abstracts
Statistical Design of Experiments
Shao-Wei Cheng
Institute of Statistical Science
Academia Sinica
Experimentation is one of the most common learning processes that people engage in for gathering knowledge, solving problems, or testing conjecture. However, data collected from an experiment that is not well planned may carry little or no useful information, but could consume a large amount of resource and time. The design of experiments (DOE) in statistics is an efficient and economical procedure for planning experiments so that the data obtained can be analyzed to yield valid and objective conclusions.
Beginning with an example of effective experimentation in scientific/engineering investigations, the lecture will progress through different types of experimentation, the role of statistics in planning experiments, basic principles in DOE, to an overview of various types of experimental designs and some of the most useful basic and advanced data analysis methods. Several actual industrial case studies will be used to illustrate the concepts, and to show how DOE can be applied successfully in practice.
Information Theory and Statistics
Mark Hansen
University of California at Los Angeles
Bin Yu
Statistics Department, UC Berkeley
Information Theory deals with a basic challenge in communication: How do we transmit information efficiently? In addressing that issue, Information Theorists have created a rich mathematical framework to describe communication processes with tools to characterize so-called fundamental limits of data compression and transmission.
What might Statisticians learn from Information Theory? Basic concepts like entropy and Kullback-Leibler divergence have certainly played a role in statistics. But so too have estimation frameworks like the Maximum Entropy principle; novel decompositions like ICA; and even model selection methodologies like AIC and the Principle of Minimum Description Length. In this course we will illustrate how the basic questions and tools of Information Theory relate to statistical practice and theory. In particular, we will use examples from data compression, language modeling, natural image analysis, internet tomography, and neuroscience.
Modern Monte Carlo Methods and Their Applications
Jun S. Liu
Department of Statistics
Harvard University
This lecture series aims at providing a systematic illustration of the basic Markov chain Monte Carlo (MCMC) techniques and several new ideas on the research front of the MCMC methodology. Topics include, for example, the fundamental idea of Metropolis et al. (1953) for constructing a desirable Markov chain, Gibbs sampling strategies, multigrid Monte Carlo, population-based MCMC methods such as parallel tempering and evolutionary Monte Carlo, etc. Another subject that I will describe is the sequential Monte Carlo method recently developed for dealing with dynamic structures, such as the nonlinear state-space models, expert system with sequential observations, and protein structure analysis, etc.
Monte Carlo methods have been crucial in many scientific endeavors, ranging from physics to biochemistry, and have recently become very popular in the statistics community. I will describe a few applications of the MCMC methodology in bioinformatics, nonparametric Bayes computation, model selection, clustering, etc.
Statistical Methods and the Analysis of Traffic Data
John Rice
University of California, Berkeley
A traffic performance measurement system, PeMS, currently functions as repository for traffic data gathered by thousands of automatic sensors all over the state of California. It has integrated data collection, processing, and communications infrastructure with data storage and analytical tools.
PeMS is a joint effort by the California Department of Transportation (Caltrans), the University of California, Berkeley, and PATH, the Partnership for Advanced Technology on the Highways. The software that has been developed in conjunction with this project, the Performance Measurement System, PeMS, is a traffic data collection, processing and analysis tool to assist traffic engineers in assessing the performance of the freeway system. PeMS extracts information from real-time and historical data and presents this information in various forms to assist managers, traffic engineers, planners, freeway users, researchers, and traveler information service providers.
With PeMS, Caltrans managers can instantaneously obtain a uniform, and comprehensive assessment of the performance of their freeways. Traffic engineers can base their operational decisions on knowledge of the current state of the freeway network. Planners can determine whether congestion bottlenecks can be alleviated by improving operations or by minor capital improvements. Traffic control equipment (ramp-metering and changeable message signs) can be optimally placed and evaluated. In short, PeMS can serve to guide and assess the deployment of intelligent transportation systems. More information about PeMS can be found at http://pems.eecs.berkeley.edu/Public/
In these lectures, I will give an overview of PeMS, and will concentrate on presenting and explaining the statistical methodology that is part of PeMS. In particular, I will focus on detecting sensor malfunction, imputation of missing or bad data, and estimation of velocity and forecasting of travel times on freeway networks. We will see that a variety of modern statistical techniques are applicable in this setting.
Modeling by the MDL Principle
Jorma Rissanen
Helsinki Institute for Information Technology,
Technical Universities of Tampere and Helsinki, Finland,
and
University of London, Royal Holloway, UK
Data generated by a physical process incorporate regular features reflecting the generating machinery and noise. The objective of modeling is to learn the regularities and, for some applications, construct a smooth curve as the `law' representing the restrictions forced by the data generating machinery leaving the rest of the data as `noise'. The MDL (Minimum Description Length) principle seeks to minimize the code length of the data, given a class of models, which provides a formal way to measure three fundamental properties of data, the complexity, the amount of the regular features, and the amount of noise. The noise then gets defined, not as the high frequency part in the data as commonly done, but as the random incompressible part in light of the models considered.
In this talk I outline the recent status of the MDL theory with applications on the construction of universal models and denoising.
Applied statistics, models and algorithms.
Terry Speed
Department of Statistics and Program in Biostatistics,
University of California, BERKELEY, CA 94720-3860
In my view (applied) statistics is about using data to answer questions, in some context. All three are important: the context, the questions, and, of course, the data. To a first approximation, I think it is true to say the vast majority of statistical models and methods in use today by statisticians and more traditional users of statistics had their origins in the middle half of the 20th century, say the period 1925-1975. From the 1980s onwards, with the rise of computers and computerized data collection methods, the quantities of data available, the scope of statistical analyses, and the people carrying out the analyses has changed the field beyond recognition. It was probably always true that the majority of statistical analysis was carried out by non-statisticians, and that seems even more true today. How many professional statisticians have been involved in companies such as Google? How close are the techniques of Google to Fisherian statistics?
In a controversial 2001 Statistical Science article entitled "Statistical Modeling: The Two Cultures" Leo Breiman argued strongly for what he termed "algorithmic modelling", saying "If our goal as a field [i.e.Statistics] is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools." You can read the discussion and Leo's rejoinder in the journal.
Rather than continue that debate, I want to take a different tack. I want to discuss the tools, techniques, and modes of thinking from the period of "traditional" statistics (1925-1975), and consider their relevance to the statistics of today. I'm thinking of topics such as a) Looking at your data; b) Assessing data quality; c) Worrying about heterogeneity; d) Design, or lack of it; e) Beginning your analysis, and e) Assessing how well you are doing. For each you should ask: How do you do it? Is it relevant and/or necessary? What are the things you should think about?
I'll talk about these topics in the context of problems I have met, or have been told about, mainly from genomics. Issues which arise include artifacts, transforming data, normalization, classification.
Embracing Statistical Challenges in the Information Technology Age
Bin Yu
Statistics Department, UC Berkeley
www.stat.berkeley.edu/~binyu
Information technology advances are making data collection possible in most if not all fields of science and engineering and beyond. Statistics as a scientific discipline is challenged and enriched by the new opportunities resulted from these high-dimensional data sets. Often data reduction or feature selection is the first step towards solving these massive data problems.
In this talk, I will use several research projects to demonstrate how these IT or feature selection challenges are met by finding new applications of traditional statistical thinking and methods and by incorporating compression and computation considerations into statistical estimation. In particular, I will cover cloud detection over the polar region, microarray image compression for statistical analysis, and a new algorithm BLasso (Boosted Lasso) as a computationally alternative
for feature selection and for sparse nonparametric regression model fitting.
http://www.math.pku.edu.cn:8000/misc/summer_school/