Statistical inference for data science

Categories:

Recommended

Introduction

Before beginning 

This book is designed as a companion to the Statistical Inference Coursera class as part of the Data Science Specialization, a ten course program offered by three faculty, Jeff Leek, Roger Peng and Brian Caffo, at the Johns Hopkins University Department of Biostatistics.

The videos associated with this book can be watched in full here, though the relevant links to specific videos are placed at the appropriate locations throughout.

Before beginning, we assume that you have a working knowledge of the R programming language. If not, there is a wonderful Coursera class by Roger Peng, that can be found here.

The entirety of the book is on GitHub here. Please submit pull requests if you find errata! In addition the course notes can be found also on GitHub here. While most code is in the book, all of the code for every figure and analysis in the book is in the R markdown files files (.Rmd) for the respective lectures.

Finally, we should mention swirl (statistics with interactive R programming). swirl is an intelligent tutoring system developed by Nick Carchedi, with contributions by Sean Kross and Bill and Gina Croft. It offers a way to learn R in R. Download swirl here. There’s a swirl module for this course!. Try it out, it’s probably the most effective way to learn.

Statistical inference defined 

Watch this video before beginning.

We’ll define statistical inference as the process of generating conclusions about a population from a noisy sample. Without statistical inference we’re simply living within our data. With statistical inference, we’re trying to generate new knowledge.

Knowledge and parsimony, (using simplest reasonable models to explain complex phenomena), go hand in hand. Probability models will serve as our parsimonious description of the world. The use of probability models as the connection between our data and a populations represents the most effective way to obtain inference.

Motivating example: who’s going to win the election? 

In every major election, pollsters would like to know, ahead of the actual election, who’s going to win. Here, the target of estimation (the estimand) is clear, the percentage of people in a particular group (city, state, county, country or other electoral grouping) who will vote for each candidate.

We can not poll everyone. Even if we could, some polled may change their vote by the time the election occurs. How do we collect a reasonable subset of data and quantify the uncertainty in the process to produce a good guess at who will win?

Motivating example, predicting the weather

When a weatherman tells you the probability that it will rain tomorrow is 70%, they’re trying to use historical data to predict tomorrow’s weather – and to actually attach a probability to it. That probability refers to population.

Motivating example, brain activation 

An example that’s very close to the research I do is trying to predict what areas of the brain activate when a person is put in the fMRI scanner. In that case, people are doing a task while in the scanner. For example, they might be tapping their finger. We’d like to compare when they are tapping their finger to when they are not tapping their finger and try to figure out what areas of the brain are associated with the finger tapping.

Summary notes 

These examples illustrate many of the difficulties of trying to use data to create general conclusions about a population. Paramount among our concerns are:

  • Is the sample representative of the population that we’d like to draw inferences about?
  • Are there known and observed, known and unobserved or unknown and unobserved variables that contaminate our conclusions?
  • Is there systematic bias created by missing data or the design or conduct of the study?
  • What randomness exists in the data and how do we use or adjust for it? Here randomness can either be explicit via randomization or random sampling, or implicit as the aggregation of many complex unknown processes.
  • Are we trying to estimate an underlying mechanistic model of phenomena under study?

Statistical inference requires navigating the set of assumptions and tools and subsequently thinking about how to draw conclusions from data.

Category:

Attribution

Brian Caffo (2016), Statistical inference for data science, URL: https://leanpub.com/LittleInferenceBook/read

This work is licensed under Attribution-NonCommercial-ShareAlike 3.0 Unported License  (https://creativecommons.org/licenses/by-nc-sa/3.0/deed.en_US).

VP Flipbook Maker

Display and share your work as digital flipbook with VP Online Flipbook Maker! We can also create a new flipbook with the tool. Try it now!