Wednesday, July 23, 2014

Are Non-Parametric Methods Simpler To Teach Than Parametric Ones?

There is an adage that any title that ends in a question mark can be answered "No.". However, this proposal might have some weight. I have seen students take an entire course in statistics and not be able to identify what a statistic even is, which is almost a crime. Some of the blame comes from disinterested students cramming to fill a requirement, but part of the problem is how statistics is taught. Students are not taught to see what a statistic is in its natural environment. In fact, with automated statistics programs, it would be much easier to teach non-parametric and data driven statistics first, then teach parametric statistics with regression analysis in a second course. Instead, it's instantly on to z-scores and t-scores as if just because Fisher found them first they must be the easiest!

But simplest in theory is not simplest to learn. A better method would be to emphasize probability distributions, data driven methods and non-parametric tests and using statistical software (in a stats for life, non-calculus based statistics course especially!). These are more complex theorems, but I'm talking about classes in which central limit theorem isn't proven in the current system anyway. If you aren't proving the theorems anyway it doesn't matter how difficult it is to prove them. Anyway, the most important fact in statistics - what makes statistics work at all - is that gathered data can be used to generate a probability distribution. Everything that we learn from the statistics are facts about this distribution. Again, this is not emphasized in stats for life classes, and I don't know why. In my opinion, an entire section of the course, perhaps a month, should be spent on taking data and looking at a distribution, a pdf and a cdf curve. The student should learn what a statistic is by relating them to the geometric pictures they are getting from gathered or generated data. For instance, the measures of center (means, medians, modes) should be related to the actually observed centers in data. This is supplemented with the use of statistical software - perhaps R for advanced students, Excel for less advanced students - to show how statistics are found in practice. Once the students understand that a statistic is a function of a distribution, then we can move on to tests. Several non-parametric tests, such as the Kolmogorov-Smirnov Test and cdf-based nonparametric confidence intervals are easily related to the geometry of the distributions and easily coded. This experience will teach them the students the point and practice of statistical tests more than z-scores as they are currently taught, because the way z-scores are currently taught relates them to a distribution that doesn't obviously come out of data. The students are unused to thinking about data as a distribution because they are taught that only a few distributions and only given the CLM as a heavenly cheat that means that we don't have to think about how large sets of data will be distributed to estimate the mean (which even by itself is not language they are used to!). Obviously the normal distribution frequently comes out of data asymptotically, but I've found that the students find this too many hurdles to leap at once. If we give up that useful fact and concentrate on teaching the basics - distributions, statistics and statistical tests - it will seem less magical to the students.

A personal note: I remember the first time I taught statistics the same reason that I remember the first time I drove a car that caught fire - the nightmares. However, the students did react to some things well. I had them run a roulette simulation in excel, to show that asymptotically they would lose money. Seeing the data and the trends helped them immensely, they learned a lot and enjoyed it. In retrospect, I realize I could have taught much more like this. In fact, everyone can. I could have had them make a kernel of the roulette outcomes, so that they would realize it is a pdf. I could have had them find the statistics of that distribution, or run a regression and relate the regression to the statistics. All opportunities wasted.

To summarize:
1. Statistics stands on three pillars.
2. The first pillar is that data induces a probability distribution.
2a. But in current statistical teaching practice, students are not drilled into instantly putting data into a probability distribution. Since this can be done easily with technology, they should be.
2b. This means that even though non-parametric statistics is in general harder for statisticians, this doesn't mean it is to learn for students. After all, we can control the data they use so that they see mostly well behaved data in homework.
3. The second pillar is that certain functions of those empirical distributions capture the facts about the distribution that we care about. These functions are called statistics.
3a. But in current statistical teaching practice, statistics are introduced piecemeal (even then, only the measures of center and the variance) and poorly connected to probability distributions. Once a student is used to creating probability distributions from data in the previous step, computing functionals of the distributions is easy. For instance "find the max" is equivalent to computing a mode, "find the center of gravity" is equivalent to finding the mean and "find the middle" is equivalent to finding the median.
3b. This assignment is an extension of the above - "given data draw the empirical distribution" was the previous one, this one is "given data draw the empirical distribution, print it out and make marks on certain spots" is this one. This can be supplemented with using technology to find these automatically.
4. The final pillar of statistics is statistical testing, do the statistics of the data say what we want?
4a. But in current statistical practice, statistical testing is introduced mainly in a specific case, z and t scores. This means that statistical testing is presented piecemeal, with only a few sentences said for justification. To understand this choice, one must understand the central limit theorem. Answering these questions requires the teacher to hand a distribution to you, which detaches the student from the process of going from data to geometry to conclusion.
4b. This assignment is an extension of the above - "given data draw the empirical distribution, print it out and make marks on certain spots" was the previous one, this one is "given data draw the empirical distribution, print it out and make marks on certain spots, then draw a confidence interval around those spots" is this one. This can be supplemented with using technology to find the intervals automatically.
6. Whereas the teaching of z-scores and t-scores encourages students to use those tests without justification, the teaching of non-parametric statistics in this manner will encourage students to think about empirical distribution and what statistics are important first, which will give them access to a larger toolkit. In a second course on statistics, it can be further pointed out for the statistics of greatest interest certain distributional forms can be expected, meaning z-scores and t-scores can come back into the curriculum, this time in the proper place.

Alternately, we can tell people that good science is whatever weird randomness you can get in a lab, testing be damned. After all, some prefer theft to honest toil...

No comments:

Post a Comment