Finding Needles in Haystacks: Tools for Finding Structure in Large Datasets

        Brian D. Ripley
   University of Oxford

Several groups have introduced terms such as KDD (knowledge discovery in databases), Data Mining, Machine Learning and Neural Networks for the challenges of finding structure in large databases. Statisticians have been nowhere near as good at inventing catchy titles nor of promoting their wares, but have an outstanding track record of developing useful tools first! We are only now beginning work on microarrays, so the examples will mainly be drawn from biological imaging, particularly functional and structural MRI of human brains.

The aim of the talk will be to give an overview of the statistical toolbox and some of its limitations, including

  • data visualization
  • searching for needles of known form
  • 'cluster analysis' with information about clusters
  • artefact / chance event / real scientific discovery?
  • strata of variation
  • need for robustness and self-calibration.