Statistics in plain English: May 2014

In the simplest definition, a "variable" is something that varies or changes. It is the opposite of a constant, which does not change. In statistics, a variable is something we measure, and we end up with different measurements when we measure different people or measure the same people at different times.

"My weight first thing in the morning" is a variable that changes at least slightly pretty much every single day. I step on the scale first thing in the morning and everyday I get a somewhat different readings like 130.0, 129.7, 128.9, 131,3...etc.

"Ranking on air quality" is a variable that varies from city to city according to an environmental agency that test the air quality of the major cities in the states and rank-order them. Each of the city listed has a different ranking.

"Favorite TV show" is a variable that changes from person to person. I can ask all the neighbors on my block and I will end up with many different TV shows being the "favorite TV show".

So the name of the variable has to be "generic" enough so that it can take on different "values" or "categories" if we collect data by taking a number of measurements on that variable. We do not refer to one particular measurement or value and call it a "variable." For example, the variable is "favorite TV show" instead of "CSI", "ranking on air quality" instead of #5 rank for a certain city, "my weight in the morning" instead of "130.0". If we get the hang of this, I should not see any more of the error of describing "two variables" as male and female, or as high and low rankings. Male and female are the different categories within the same variable "gender"; all rankings are the possible values within the same variable of "ranking on air quality."

Variables like "favorite TV show" is categorical or nominal, because it contains several categories with different names (nominal) and people can be sorted into those categories.

Variables like "ranking on air quality" is ordinal, because the different rankings are in an order. Rank 1 is higher than rank 5, which is higher than rank 15.

Variables like "my weight first thing in the morning" is a continuous variable because the scale used to measure my weight is continuous one where I can get a measurement anywhere on the scale, not prescribed by a certain categories or ranks.

The variable type has important implications in statistics because we obviously can do more complicated mathematical operations on a continuous variable, compared to ordinal or nominal variables. Imagine taking an average across the different TV shows or figuring out the air quality difference by subtracting one city's ranking from another city's ranking! Don't ever attempt these tasks at home or at school!!

It drives me nuts when I see students discuss "sample population" in their dissertations. And different people seem to define this nonsense term differently, too. Sometimes they describe the actual sample for their study, but other times they describe the population from which the sample was drawn. But the two words mean such different (and somewhat opposite) things that they cannot form a compound word.

If you go to Costco and try a sample of the pan-fried brand X dumplings, it's called a "sample" because you are not capable (or willing or allowed) to eat all the brand X dumplings in the store. So a sample is a selected representative of a large group of something. If you like the sample of the pan-fried dumplings, you then assume that all the dumplings of the same brand sold in the store are equally good and you decide to buy a pack. The whole inventory of brand X dumplings in the store is the "population" from which your "sample" comes from. The sample naturally cannot be called a population or vice versa.

An interesting sampling problem is that your sample may have been good for a number of reasons, not just because of the dumpling are made by company X. You may think it's good because...

a) You had to fight the hungry mob surrounding the stand just to get a sample.

b) You happened to be very hungry and everything just tasted better than usual.

c) The sample was cooked with a nice table-top grill and just the right amount of one particular type of oil.

d) The store staff has become an expert in cooking dumplings.

None of these conditions can be replicated when you eat the brand X dumplings you take home. As a result, your sample ends up not quite representing the population of brand X dumpling sold by Costco.

Here's another potential sampling problem. Even though you like the sample at Costco, you decide to buy it elsewhere because the Costco package has enough dumplings for a whole football team and there are only four of you in the family. When you find brand X dumplings in a supermarket, however, the packaging looks different even thought the content seems to be the same. But you assume that your sample should be similar enough to all the same dumplings made by company X and you purchase a bag. You may be disappointed when you eat it at home because it turns out that brand X provides special batches of these dumplings through different packaging process for Costco. And somehow those procedural differences made the Costco version of the dumplings better tasting than the supermarket version.

These considerations are the same for a research project. Let's say I am interested in whether using a tablet in the fourth-grade classroom can enhance the learning outcome of reading comprehension. And I find out that one of the school X fourth-grade teachers allows all their students to use their tablets (from home or provided by the school) to read during a daily reading period while another teacher in the same grade in the same school does not allow the use of tablets during the same reading period. So I compared the tablet group and the no-tablet group in terms of their progress made during the school year based on a reading test administered before and after the school year. These fourth-grade students in the four teachers' classrooms in this particular school constitute my "sample" for the research project, but what I would really like to find out is whether using tablet in the classroom would be beneficial to all fourth graders in the U.S., which would be my "population." Alternatively, I might even be interested in all fourth graders in the world as my "population. So the question is whether or to what extent my sample of fourth-graders can represent my target population. Many factors then become relevant: The particular demographics of these students, the particular teachers' experience... etc.

If school X has many fourth-grade classrooms and teachers, I could say that my target population is simply the fourth graders in this particular school so that my sample of students would be a pretty good representative of the population. But then what is the research value of finding out whether using tablet is beneficial only for this particular school or not?

Statistics in plain English

Tuesday, May 13, 2014

A VARIABLE must vary!

Monday, May 12, 2014

What do you mean by SAMPLE POPULATION???