class: center, middle, inverse, title-slide # Estimation & Inference ## PHS SummR camp ### Christopher Boyer ### 2021-08-27 --- ## Plan for today - Core concepts - Populations and parameters - Random sampling - Law of large numbers - Central limit theorem - Estimation Theory - What is an estimator? - Properties of estimators - Inference - Hypothesis testing - P-values - Confidence intervals - Exercises --- ## Some thoughts on pedagogy ### Why people are nervous about statistics courses - Fear of saying something wrong - The math ### My philosophy - Intuition first, math second - Ask questions - Assumptions, assumptions, assumptions - Acknowledging the history --- ## Statistics The science of learning from samples. - Estimation = how can I estimate the population quantity I want given my sample? - Inference = how certain am I about my estimate of the population quantity? --- ## Parameters and populations The paradigm: --- ## Sampling If statistics is the science of learning from samples, it is often built upon the core assumption that sampled observations are independent and identically distributed (i.i.d.). Random variables `\((X_1, X_2, \ldots, X_N)\)` are *independent* if `$$X_i \perp\!\!\!\perp X_j \quad \text{ for all } i \neq j$$` or, equivalently `$$\operatorname{Cov}(X_i, X_j) = 0 \quad\text{ for all } i \neq j$$` Random variables `\((X_1, X_2, \ldots, X_N)\)` are *identically distributed* if `$$f(X_i) = f(X_j) \quad \text{ for all } i \neq j$$` --- ## i.i.d. as an approximation The i.i.d. assumption is fulfilled by design when observations are *randomly* sampled. .pull-left[ <img src="inference-slides_files/figure-html/iid-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <table> <thead> <tr> <th style="text-align:right;"> sample </th> <th style="text-align:right;"> X1 </th> <th style="text-align:right;"> X2 </th> <th style="text-align:right;"> X3 </th> <th style="text-align:right;"> X4 </th> <th style="text-align:right;"> X5 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0.452 </td> <td style="text-align:right;"> 0.496 </td> <td style="text-align:right;"> -0.265 </td> <td style="text-align:right;"> 0.541 </td> <td style="text-align:right;"> -1.346 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 0.829 </td> <td style="text-align:right;"> 0.694 </td> <td style="text-align:right;"> 1.648 </td> <td style="text-align:right;"> -0.199 </td> <td style="text-align:right;"> -0.026 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> -0.485 </td> <td style="text-align:right;"> -0.447 </td> <td style="text-align:right;"> 2.295 </td> <td style="text-align:right;"> -0.369 </td> <td style="text-align:right;"> -1.790 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> -0.213 </td> <td style="text-align:right;"> 0.651 </td> <td style="text-align:right;"> -1.789 </td> <td style="text-align:right;"> -2.088 </td> <td style="text-align:right;"> 0.034 </td> </tr> </tbody> </table> ] However, in practice this is often only an approximation, e.g.: - Non-random selection in cohort studies, RCTs, and EMR data - Complex survey data where the units are weighted or stratified We'll start this semester by generally assuming that this assumption is fulfilled, but towards the end we'll cover some methods for correlated data where it fails. Depending on the nature of the failure of the assumption we could actually be ok. --- ## Weak law of large numbers Let `\(X_1, X_2, \ldots, X_n\)` be i.i.d. random variables with finite variance, i.e. `\(\operatorname{Var}[X] < \infty\)`, then $$\frac{1}{n} \sum_{i=1}^n X_i \overset{p}{\rightarrow} \mathbb{E}[X], \text{ as } n \rightarrow \infty $$ More generally, for any `\(k\)`th moment, if `\(\mathbb{E}[X^{k + 1}] < \infty\)`, then $$\frac{1}{n} \sum_{i=1}^n X_i^k \overset{p}{\rightarrow} \mathbb{E}[X^k], \text{ as } n \rightarrow \infty $$ A related concept is the **plug-in principle**, i.e. to estimate a population feature simply use the sample analog! Summary: as the size of samples increase towards infinity, sample features approach population corollaries. --- ## Weak law of large numbers Example 1: drawing samples from population following distribution `\(N(0, 1)\)`. <img src="inference-slides_files/figure-html/wlln-dist-1-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Weak law of large numbers When the conditions hold... e.g. increasing samples from `\(N(0, 1)\)`. <img src="inference-slides_files/figure-html/wlln-example-1-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Weak law of large numbers Example 2: drawing samples from population following distribution `\(\operatorname{Exp}(0.1)\)`. <img src="inference-slides_files/figure-html/wlln-dist-2-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Weak law of large numbers When the conditions hold... e.g. increasing samples from `\(\operatorname{Exp}(0.1)\)`. <img src="inference-slides_files/figure-html/wlln-example-2-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Weak law of large numbers Example 3: drawing samples from population following distribution `\(\operatorname{Cauchy}(0, 1)\)`. <img src="inference-slides_files/figure-html/wlln-dist-3-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Weak law of large numbers When the conditions do not hold... e.g. increasing samples from `\(\operatorname{Cauchy}(0, 1)\)`. <img src="inference-slides_files/figure-html/wlln-example-3-1.png" width="100%" style="display: block; margin: auto;" /> --- ## The Central Limit Theorem Let `\(X_1, X_2, \ldots, X_n\)` be i.i.d. random variables with finite mean, `\(\mathbb{E}[X] = \mu\)`, and variance, `\(\operatorname{Var}[X] = \sigma^2\)`, then $$ \frac{\sqrt{n}(\bar{X} - \mu)}{\sigma} \overset{d}{\rightarrow} N(0, 1)$$ or equivalently, $$ \bar{X} \overset{d}{\rightarrow} N(\mu, \dfrac{\sigma^2}{n})$$ Summary: as the size of samples increase towards infinity, the distribution of the sample mean is approximately Normal. The CLT is much broader than just a description of the sample mean: delta method, taylor-series approximation, plug-in estimators. --- ## The Central Limit Theorem When the conditions hold... e.g. increasing samples from `\(N(0, 1)\)`. <img src="inference-slides_files/figure-html/clt-example-1-1.png" width="100%" style="display: block; margin: auto;" /> --- ## The Central Limit Theorem When the conditions hold... e.g. increasing samples from `\(\operatorname{Exp}(0.1)\)`. <img src="inference-slides_files/figure-html/clt-example-2-1.png" width="100%" style="display: block; margin: auto;" /> --- ## The Central Limit Theorem When the conditions do not hold... e.g. increasing samples from `\(\operatorname{Cauchy}(0, 1)\)`. <img src="inference-slides_files/figure-html/clt-example-3-1.png" width="100%" style="display: block; margin: auto;" /> --- ## The sampling distribution --- class: inverse center middle # Estimation --- ## Point estimation Suppose there is some population feature `\(\theta\)` which we would like to learn. As before we observe i.i.d. samples `\((X_1, X_2, \ldots, X_n)\)`. Estimation theory is concerned with how best to estimate the value of `\(\theta\)` given the sample. ### Definitions - **Estimand** `\((\theta)\)`: The population quantity or parameter of interest. - **Estimator** `\((\widehat{\theta})\)`: A function of the sample used to estimate the estimand. - **Estimate**: A single realized value of the estimator in a particular sample. ### Examples --- ## Methods of estimation - Method of moments - Maximum likelihood - Bayesian estimation - Ordinary least squares --- ## What makes a good estimator? A good estimator should give you the right value "on average". - Bias - Consistency A good estimator should vary as little as possible from the true value. - Sampling variance - Mean squared error - Relative efficiency A good estimator should have a known or approximate sampling distribution. - Asymptotic normality --- ## Bias For estimator `\(\widehat{\theta}\)` the *bias* is the difference between the "average" value of the estimator and the true value `\(\theta\)`, i.e. `$$\text{Bias}(\widehat{\theta}) = \mathbb{E}[\widehat{\theta}] - \theta$$` We say an estimator is *unbiased* if `\(\mathbb{E}[\widehat{\theta}] = \theta\)`. --- ## Consistency An estimator `\(\widehat{\theta}\)` is *consistent* if `$$\widehat{\theta} \overset{p}{\rightarrow} \theta, \quad \text{ as } n \rightarrow \infty$$` Alternatively, could show that `$$\lim_{n \rightarrow \infty} \text{Bias}(\widehat{\theta}) = 0$$` A biased estimator can be consistent, but an unbiased estimator is always consistent, i.e. `$$\text{Unbiased } \widehat{\theta} \implies \text{Consistent } \widehat{\theta}, \text{ but } \text{Consistent } \widehat{\theta}\; \not\!\!\!\implies \text{Unbiased } \widehat{\theta}.$$` --- ## Efficiency For estimator `\(\widehat{\theta}\)` the *mean squared error (MSE)* is `$$\text{MSE}(\widehat{\theta}) = \mathbb{E}[(\widehat{\theta} - \theta)^2]$$` Which is equivalent to `$$\text{MSE}(\widehat{\theta}) = \operatorname{Var}[\widehat{\theta}] + \underbrace{(\mathbb{E}[\widehat{\theta}] - \theta)^2}_{[\text{Bias}(\widehat{\theta})]^2}$$` For any two estimators `\(\widehat{\theta}_1\)` and `\(\widehat{\theta}_2\)`, we say `\(\widehat{\theta}_1\)` is more *efficient* than `\(\widehat{\theta}_2\)` if it has a lower MSE. --- ## Asymptotic normality An estimator `\(\widehat{\theta}\)` is *asymptotically normal* if `$$\frac{\sqrt{n}(\widehat{\theta} - \theta)}{\sqrt{\operatorname{Var}[\widehat{\theta}}]} \overset{d}{\rightarrow} N(0,1)$$` --- class: inverse center middle # Inference --- ## Hypothesis testing - Focus of basic statistics courses across the country - Evaluates the compatibility of observed data with some “null” hypothesis `\(H_0\)`, which is the default assumption for the model generating the data - Two possible options: - Reject the null - Fail to reject the null (we never “accept” the null) - Usually null hypothesis is selected so that rejecting the null ≈ identifying something of scientific importance - Framework does not distinguish between statistically and practically significant results - As the researcher, this part is your job! --- ## Null hypothesis significance testing (NHST) The elements of a hypothesis test: - `\(H_0\)`: the null hypothesis - `\(H_A\)` or `\(H_1\)`: the alternative hypothesis; usually the complement of `\(H_0\)` - `\(T\)`: a test statistic calculated from the data - Null distribution: the sampling distribution for `\(T\)` if `\(H_0\)` is true - Rejection region: the set of values `\(t\)` for which `\(T = t\)` would lead us to reject the null - Acceptance/non-rejection region: the set of values `\(t\)` for which `\(T = t\)` would lead us to fail to reject the null Note: An estimator is a statistic, but a statistic doesn't have to be an estimator. --- ## Null hypothesis significance testing (NHST) We can commit two types of errors: - Reject when the null is true - Fail to reject when the null is false <table> <thead> <tr> <th style="text-align:left;"> Decision </th> <th style="text-align:left;"> `\(H_0\)` true </th> <th style="text-align:left;"> `\(H_1\)` true </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Reject `\(H_{0}\)` </td> <td style="text-align:left;"> Type I error </td> <td style="text-align:left;"> Correct decision </td> </tr> <tr> <td style="text-align:left;"> Don't reject `\(H_{0}\)` </td> <td style="text-align:left;"> Correct decision </td> <td style="text-align:left;"> Type II error </td> </tr> </tbody> </table> Say we have constructed a test statistic `\(T\)` whose sampling distributions we know 1. under the null 2. under the alternative Q: How do we know when to reject `\(H_0\)`? A: It depends on how concerned we are about the two error types! --- ## Significance level of a test The significance level `\(\alpha\)` of a testing procedure is the probability of rejecting when the null is true. `$$\alpha = \Pr(\text{reject } H_0 \mid H_0 \text{ true})$$` Conventionally, `\(\alpha = 0.05\)`, giving a 5% Type I error rate, however, originally this was supposed to be decision-specific. Bigger `\(\alpha\)` implies more likely to reject, but also more Type I error. --- ## Power of a test The power of a testing procedure is the probability of rejecting when the null is false. `$$1 − \beta = P(\text{reject } H_0 \mid H_0 \text{ false})$$` Studies are usually designed to be large enough to have at least 80% power to detect a clinically meaningful difference. In general, for tests `\(T_1\)` and `\(T_2\)` of the same hypothesis but different power, if we hold the significance level constant, we prefer the one with higher power. Note: `\(\beta\)` can sometimes refer to the power function `\(\beta(\theta)\)`, which is the probability of rejecting if the true parameter value is `\(\beta\)`, but it also sometimes refers to the Type II error rate. This is confusing! --- ## Example --- ## P-values Let `\(\widehat{\theta}\)` be an estimator of `\(\theta\)` and let `\(\widehat{\theta}^*\)` be the observed value. Then a one-sided p-value is `$$p = \Pr\left[\widehat{\theta} \geq \widehat{\theta}^* \mid \theta = \theta_0\right] \quad \text{or} \quad p = \Pr\left[\widehat{\theta} \leq \widehat{\theta}^* \mid \theta = \theta_0\right]$$` and a two-sided p-value is `$$p = \Pr\left[|\widehat{\theta}| \geq |\widehat{\theta}^*| \mid \theta = \theta_0\right]$$` The probability of obtaining an estimate as extreme or more, **when the null is true**. --- ## P-values ### Key take-aways - The p-value is a statement about the *compatability* of the observed data with the null hypothesis. - In a null hypothesis significance testing framework, the p-value can be used to determine *statistical significance* (i.e. if `\(p < \alpha\)`). - However, the p-value does not have to be used solely for significance testing, indeed it is a continuous measure of evidence in it's own right. - Elements necessary: an estimator, an estimate, a hypothesis, and a distribution of the estimator under the hypothesis. ### Common misconceptions - The p-value is not the probability that the null is true. --- ## Normal approximation-based p-values Based on the **central limit theorem**, we know that the limit distribution of many estimators as `\(n \rightarrow \infty\)` is Normal. We can use this fact to calculate an approximate p-value based on the Normal distribution. For a one-sided test `$$p = \Phi\left(\frac{\widehat{\theta}^* - \theta_0}{\sqrt{\operatorname{Var}[\widehat{\theta}]}}\right) \quad \text{ or } \quad p = \Phi\left(\frac{\widehat{\theta}^* - \theta_0}{\sqrt{\operatorname{Var}[\widehat{\theta}]}}\right)$$` For a two-sided test `$$p = 2\left(1 - \Phi\left(\frac{|\widehat{\theta}^* - \theta_0|}{\sqrt{\operatorname{Var}[\widehat{\theta}]}}\right)\right)$$` Where `\(\Phi\)` is shorthand for the CDF of the standard normal distribution (i.e. `\(N(0, 1)\)`). --- ## P-value functions Although p-values are often used for null hypothesis testing we can actually use them to test the compatibility with any number of hypotheses. In fact, by considering a range of hypothesized values, e.g. `\(\theta_0 = (10, 90)\)`, we can actually use them to estimate the value most compatible with the observed data (i.e. estimation)! <img src="inference-slides_files/figure-html/p-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Confidence intervals A valid 95% confidence interval for parameter `\(\theta\)` is a random interval `\(CI_{95}(\theta) = (X_{lower}, X_{upper})\)` such that `$$\Pr[\theta \in CI_{95}(\theta)] \geq 0.95$$` <img src="inference-slides_files/figure-html/ci-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Confidence intervals ### Key take-aways - Loosely, speaking it's an interval that will cover the true value of `\(\theta\)` at least 95% of the time. - In frequentist statistics the parameter `\(\theta\)` is fixed, it's the *limits of the interval* that are random. - I've used 95% coverage intervals here, but we can construct intervals for any `\((1 - \alpha)\)` level we want (e.g. 90%, 80%, 50%). ### Common misconceptions - A common mistake is to say that the probability that the parameter is in a *given* interval is 95%. --- ## Normal approximation-based confidence intervals Based on the **central limit theorem**, we know that the limit distribution of the sample mean as `\(n \rightarrow \infty\)` is Normal. We can use this fact to derive an approximate 95% confidence interval based on the Normal distribution: `$$CI_{95} = (\widehat{\theta} - Z_{0.025} \cdot \widehat{SE}(\widehat{\theta}), \widehat{\theta} + Z_{0.975} \cdot \widehat{SE}(\widehat{\theta}))$$` <img src="inference-slides_files/figure-html/ci-normal-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Confidence intervals as inversions of hypothesis tests --- ## Causal inference vs. statistical inference --- class: inverse center middle # Exercise --- ## References 1. Hernán, M. A. and Robins J. M. Causal Inference: What If. Boca Raton: Chapman & Hall/CRC, 2020. 2. Aronow, P. M. and Miller B. T. Foundations of agnostic statistics. Cambridge University Press, 2019. 3. Angrist, J. D. and Pischke, J. S. Mostly harmless econometrics. Princeton university press, 2008. 4. Casella, G. and Berger R. L. Statistical inference. Cengage Learning, 2021.