Statistics in plain English: 2014

Tuesday, July 1, 2014

Hypothesis testing is like trying a case in court

In a quantitative research study, the researcher proposes a research hypothesis about something that usually represents a new idea. For example, someone may propose that playing a certain brain game can improve memory performance. Then the researcher collects data to see if s/he can show that this new idea is likely to be true. This hypothesis testing process actually resembles the legal process of trying a case in court.

In a court case, the prosecutor proposes a hypothesis that someone commits a crime. Then s/he gathers data to see if s/he can demonstrate beyond the shadow of a doubt that the hypothesis is likely to be true.

In both the research study and the court case, the fundamental basis of the process is the same: The default is that the hypothesized event or phenomenon does not exist. A person is presumed innocent until proven guilty (with convincing evidence); the brain game is presumed to do nothing for memory performance until proven effective.

In the court case, the default (of presumed innocence) is clear and known to every citizen. In a research study, we need to spell out the default for the reader, which is called the "null hypothesis" because it means that the hypothesis proposed is not valid ("null"). In contrast to the term "null hypothesis," the proposed hypothesis is called "alternative hypothesis" or "research hypothesis."

In the research study, the researcher has to provide evidence that is convincing enough to reject the null hypothesis, just as the prosecutor needs to provide evidence that is convincing enough to reject the presumed innocence of the accused individual. But when is the evidence "convincing" enough? In the course case, it's a judgment call on the part of the jury while we rely on statistics in the research study.

The convention in social science research is to consider the evidence "convincing" when there is a 95% chance, mathematically speaking, of the hypothesis being true. In other words, there is only at most a 5% chance that the observed effect is just a random finding for some unknown reason. For example, we show that people who have played the brain game for 6 months perform on average better than those (with similar baseline memory functions) who have not played the brain game. Statistical comparison of the memory scores between the two groups can tell us how likely the difference in memory scores between the two groups can happen by chance alone, when the brain game really has no effect of any kind, which means the default (null hypothesis or presumed innocence) stands.

The probability of the difference in memory scores coming from chance is the p value obtained in the statistical test. If p is lower than the threshold 5%, then we are willing to reject the default or null hypothesis. What we are saying is the following:
1. We see that the two groups of people have different scores on average.
2. We perform a statistical test to see how likely the difference is a result of random chance when the brain game really didn't have any effect.
3. We see that the probability of #2 is lower than 5%, which is the threshold that we are willing to accept. In other words, the probability of the brain game having no effect is low enough, given the memory scores we obtained.
4. Since the null hypothesis has a low enough probability, we reject the null hypothesis that the brain game has no effect. Our evidence therefore "suggest" that the rain game is likely to have an effect on memory performance.

The chosen threshold, called "alpha level", will determine how stringent your statistical test is. Setting it lower than .05 (or 5%) makes it harder to reject the null hypothesis and support your research hypothesis. But if you are successful in rejecting the null hypothesis with a lower alpha level, the research result would of course be more convincing. Setting the alpha level above .05 is usually not recommended.

Naming a variable

The name of a variable is like the last name of a family because it applies to all family members who are distinguishable by their first names.

Example 1:
The family of "gender" has two members, "male" gender and "female" gender.

Example 2:
The Likert scale family of "degree of agreement" has five members: "Strongly disagree", "disagree", "neutral", "agree", and "strongly agree."

Example 3:
The family of "SAT score" has many many possible members who are represented by specific test scores.

But sometimes it's not very obvious what the "last name" of the variable family is. For example, if we want to see if retaining a child in a grade enables her to make more progress at the end of a year, we may compare "children retained in a grade" with "children with the same level of test scores but promoted to the next grade anyway". What would be the variable to represent the two conditions of retention and no-retention?

Here's another example. We would like to see if playing violent video games affects children's playground behavior, we might compare behavior of "kids playing violent games" and "those playing non-violent video games". What would be the variable name to capture these two types of video games?

To name the variable (family), we want a label that is neutral so that each family member (variable level) can fit the label. For the grade retention example, we can name the variable "grade retention" or "grade retention status", and the two groups of kids vary along this factor. For the video game example, we can name the variable "violence in video game" so that one group experience the presence of violence but the other group experience the absence of violence in video games.

If you think of variable and their specific levels or values as family and family members, you can avoid the common mistake of describing the specific levels or values as "variables," such as the error of calling violent games and non-violent games as two variables when they belong to the same variable (family).

Tuesday, May 13, 2014

A VARIABLE must vary!

In the simplest definition, a "variable" is something that varies or changes. It is the opposite of a constant, which does not change. In statistics, a variable is something we measure, and we end up with different measurements when we measure different people or measure the same people at different times.

"My weight first thing in the morning" is a variable that changes at least slightly pretty much every single day. I step on the scale first thing in the morning and everyday I get a somewhat different readings like 130.0, 129.7, 128.9, 131,3...etc.

"Ranking on air quality" is a variable that varies from city to city according to an environmental agency that test the air quality of the major cities in the states and rank-order them. Each of the city listed has a different ranking.

"Favorite TV show" is a variable that changes from person to person. I can ask all the neighbors on my block and I will end up with many different TV shows being the "favorite TV show".

So the name of the variable has to be "generic" enough so that it can take on different "values" or "categories" if we collect data by taking a number of measurements on that variable. We do not refer to one particular measurement or value and call it a "variable." For example, the variable is "favorite TV show" instead of "CSI", "ranking on air quality" instead of #5 rank for a certain city, "my weight in the morning" instead of "130.0". If we get the hang of this, I should not see any more of the error of describing "two variables" as male and female, or as high and low rankings. Male and female are the different categories within the same variable "gender"; all rankings are the possible values within the same variable of "ranking on air quality."

Variables like "favorite TV show" is categorical or nominal, because it contains several categories with different names (nominal) and people can be sorted into those categories.

Variables like "ranking on air quality" is ordinal, because the different rankings are in an order. Rank 1 is higher than rank 5, which is higher than rank 15.

Variables like "my weight first thing in the morning" is a continuous variable because the scale used to measure my weight is continuous one where I can get a measurement anywhere on the scale, not prescribed by a certain categories or ranks.

The variable type has important implications in statistics because we obviously can do more complicated mathematical operations on a continuous variable, compared to ordinal or nominal variables. Imagine taking an average across the different TV shows or figuring out the air quality difference by subtracting one city's ranking from another city's ranking! Don't ever attempt these tasks at home or at school!!

Monday, May 12, 2014

What do you mean by SAMPLE POPULATION???

It drives me nuts when I see students discuss "sample population" in their dissertations. And different people seem to define this nonsense term differently, too. Sometimes they describe the actual sample for their study, but other times they describe the population from which the sample was drawn. But the two words mean such different (and somewhat opposite) things that they cannot form a compound word.

If you go to Costco and try a sample of the pan-fried brand X dumplings, it's called a "sample" because you are not capable (or willing or allowed) to eat all the brand X dumplings in the store. So a sample is a selected representative of a large group of something. If you like the sample of the pan-fried dumplings, you then assume that all the dumplings of the same brand sold in the store are equally good and you decide to buy a pack. The whole inventory of brand X dumplings in the store is the "population" from which your "sample" comes from. The sample naturally cannot be called a population or vice versa.

An interesting sampling problem is that your sample may have been good for a number of reasons, not just because of the dumpling are made by company X. You may think it's good because...

a) You had to fight the hungry mob surrounding the stand just to get a sample.

b) You happened to be very hungry and everything just tasted better than usual.

c) The sample was cooked with a nice table-top grill and just the right amount of one particular type of oil.

d) The store staff has become an expert in cooking dumplings.

None of these conditions can be replicated when you eat the brand X dumplings you take home. As a result, your sample ends up not quite representing the population of brand X dumpling sold by Costco.

Here's another potential sampling problem. Even though you like the sample at Costco, you decide to buy it elsewhere because the Costco package has enough dumplings for a whole football team and there are only four of you in the family. When you find brand X dumplings in a supermarket, however, the packaging looks different even thought the content seems to be the same. But you assume that your sample should be similar enough to all the same dumplings made by company X and you purchase a bag. You may be disappointed when you eat it at home because it turns out that brand X provides special batches of these dumplings through different packaging process for Costco. And somehow those procedural differences made the Costco version of the dumplings better tasting than the supermarket version.

These considerations are the same for a research project. Let's say I am interested in whether using a tablet in the fourth-grade classroom can enhance the learning outcome of reading comprehension. And I find out that one of the school X fourth-grade teachers allows all their students to use their tablets (from home or provided by the school) to read during a daily reading period while another teacher in the same grade in the same school does not allow the use of tablets during the same reading period. So I compared the tablet group and the no-tablet group in terms of their progress made during the school year based on a reading test administered before and after the school year. These fourth-grade students in the four teachers' classrooms in this particular school constitute my "sample" for the research project, but what I would really like to find out is whether using tablet in the classroom would be beneficial to all fourth graders in the U.S., which would be my "population." Alternatively, I might even be interested in all fourth graders in the world as my "population. So the question is whether or to what extent my sample of fourth-graders can represent my target population. Many factors then become relevant: The particular demographics of these students, the particular teachers' experience... etc.

If school X has many fourth-grade classrooms and teachers, I could say that my target population is simply the fourth graders in this particular school so that my sample of students would be a pretty good representative of the population. But then what is the research value of finding out whether using tablet is beneficial only for this particular school or not?