Overview. All of the material in this supplemental online text is purely optional and is meant to enhance the textbook. Use whatever aspects of it help to deepen your understanding of the material and ignore the rest. It assumes you have already read the relevant chapter. I will be adding more material as the semester progresses.
In teaching about statistics--the primary tool scientists use to make sense of their data,--it is important to me that I elucidate the differences between the scientific approach to knowledge and the religious approach to knowledge. It is also important to me that I point out that both approaches have value, depending upon the context in which they are used. I would like to take advantage of this opportunity of creating supplemental information for the text, to expand this topic to the point where it encompasses a topic that is important to me and that I would like to share. This involves talking at a conceptual level that is far beyond where the rest of this supplemental material will reside. After this chapter I will settle down to simply giving additional examples of the story problems covered in later chapters of the text.
In the first chapter of the text I describe the fundamental attributes of the scientific approach to knowledge. These attributes make the scientific approach distinctly different from other approaches, including religion. What I would like to add to that description is that while the scientific approach and the religious approach have important, fundamental, differences, they both arose from the same culture and at a deep level they share some basic assumptions about the nature of reality. These assumptions are rarely brought to light to be examined because, obviously, they are assumed to be true.
The dictionary defines a worldview as a culture's set of concepts and beliefs about the nature of reality. Both science and Western religion reside within the modern Western worldview. There are other worldviews on the planet, and there have been other worldviews in the Western world in earlier times. If we know only one worldview we tend to think it is the only one that exists and that all other worldviews are simply variations of our own. In the case of indigenous worldviews we tend to think of them as primitive versions of our Western worldview, like our own worldview but with more superstitions and less knowledge.
For the past 24 years I have been immersing myself in the worldview of the indigenous people who live in the high Andes of Peru. Their worldview, the Andean Cosmovision, supports a way of perceiving and interacting with reality that is fundamentally different than that of the Western worldview. It is a way of understanding reality that has not been influenced by the Bible, or by the classic Greek philosophers, or by Descartes' division of reality into separate mind and matter, or by the scientific revolution. This worldview simply cannot be understood through the spectacles of science or Western religion, which we in the West assume can handle anything. I speak of the Andean Cosmovision simply because it is the only worldview which which I am intimately familiar other than my own modern, Western, worldview, but I believe the same holds true for other worldviews as well.
So my point is this. In the textbook I have described the important differences between science and Western religion. They are two distinct approaches to knowledge. They both arose, however, from the same worldview and share some basic assumptions about the nature of reality. When we step out of the Western worldview, and into other worldviews we can find other ways of perceiving and interacting with reality that cannot be encompassed, described, or understood from within the Western worldview. If we don't recognize this, then when we consider other worldviews, rather than looking through a window at a new way of perceiving reality we are instead simply looking into a mirror.
If you would like to know more about the Andean Cosmovision you can visit my blog at Salka Wind Blog where I have published a great deal of information stemming from my research in Peru.
Identify each of the following measurement scales. Click "See Answer" to see answer (did I need to say that?).
This is a cardinal measure as it directly measures a quantity (number of students).
This is a nominal scale as the numbers reflect qualitative (categorical) differences rather than quantities.
This is an ordinal scale. The scores reflect a change in quantity in a specific direction (the greater the score the more likely to purchase a phone). The sizes of the steps are not necessarily equal (i.e. the difference between 'unlikely' and 'somewhat likely' may not be the same size as the difference between 'somewhat likely' and 'very likely').
This is a rank scale. While the size of a university is a cardinal scale (see above) what is being measured here is how the university ranks in comparison to other universities. The score for the largest university would be "1" which tells us how it compares to others but does not actually tell us how many students there are. While rank scales are a subset of ordinal scales, what makes a rank scale different is that the score is dependent upon how the subject compares to others in some group. While the largest university in Utah would receive a rank score of "1" if we compared it to others in the Utah, it would get a different rank score if we compared it to other universities in the whole country.
This is an ordinal scale. The scores reflect a change in quantity in a specific direction (the greater the score the lower the level of satisfaction) and the sizes of the steps are not necessarily equal.
This is a rank scale. Your score reflects your ranking in birth order in your family.
This is a nominal scale.
This is a cardinal scale.
You sample from a population and obtain the following scores. Y = 8, 7, 5, 7, 6, 4
Descriptive Statistics. Compute the following, be sure to use the correct symbols.
Inferential Statistics. Compute the following, be sure to use the correct symbols.
Eyeballing the standard deviation.
Finding proportions under the normal curve.
You have a population that is normally distributed with a mean of 80 (i.e. μ = 80) and a standard deviation of 7 (i.e. σ = 7). What proportion of the scores will be 76 or greater (i.e. Y ≥ 76)?
Draw a normal curve, label it "Original Population (Y Scores)", mark the mean and standard deviation on the curve, mark (approximately) where Y=71 and Y=89 would fall, and then shade in the areas in question (the areas where Y is less than or equal to 71 and Y is greater than or equal to 89).
The only way to determine the proportion of scores that fall in the shaded areas is by changing the Y values to z values so that we can use the Normal Tool. The formula below is used to change Y=71 and Y=89 into z scores.
Those z values have been place on the curve below.
Now use the Normal Tool to find the proportion of the curve that falls in the shaded areas.
The Normal Tool tells us that 0.5485 proportion of the scores fall in the shaded areas, which is our answer.
Note: When we want to determine what proportion of the sample means fall within certain values, we turn to the sampling distribution of the mean (SDM).
Determine the sampling distribution of the mean (SDM).
Draw a normal curve, label it "SDM for N=10", mark the mean and standard deviation on the curve. This is a graph of sample means, mark (approximately) where M=71 and M=89 would fall, and then shade in the areas in question (the areas where M is less than or equal to 71 and M is greater than or equal to 89).
The only way to determine the proportion of means that fall in the shaded areas is by changing the M values to z values so that we can use the Normal Tool. The formula below is used to change M=71 and M=89 into z scores. Note that what has changed from the computation of z above is that the denominator is different as the standard deviation of the curve is different.
We know have the following curve and we can again use the Normal Tool to find what proportion of the curve is in the shaded regions. The only thing that has changed in using the Normal Tool is that we now have different values for z.
The Normal Tool tells us that .0588 proportion of the sample means will be 71 or less or 89 or greater (i.e. 8 or more away from the population mean).
You sample six scores from a population and obtain the following scores. Y = 115, 120, 109, 120, 118, 106.
Compute the 95% confidence interval of the mean.
N = 6
Now we need to know the t value that cuts off 5% of the curve (2.5% on each tail) given our degrees of freedom (df).
df = N - 1 = 5
Go to the t Tool in Oak Software.
The t Tool screen should look like this:
When you press "Calculate" you should get t ± 2.5705.
We can now calculate the 95% confidence interval:
So the 95% confidence interval is: 108.45 ≤ μ ≤ 120.89.
There are two, acceptable, ways to interpret this:
Story problem. You are testing a theory which predicts that Population 1 should have higher scores than Population 2. That you are specifically predicting which population should have the higher scores makes this a directional hypothesis that should be analyzed with a one-tailed test. We will skip all of the number crunching and just look at which tail you shade (which will influence the p value you get).
Write Ha to reflect the prediction: μ1 > μ2
Write H0 to cover everything else: μ1 ≤ μ2
Mark the approximate location of t on the curve.
Now the question becomes do we shade in the area to the left of 't' or the area to the right of 't'? To answer that we need to look at what Ha predicts. Ha says that μ1 is greater than μ2, if that is true then M1 should be greater than M2. If we look at the formula for 't' we can see that whether 't' is negative or positive is determined by whether M1-M2 is negative or positive. Since Ha predicts that M1 should be greater than M2, then if Ha is true M1-M2 should be a positive number, and thus 't' will be positive as well. On the curve we shade in the area where Ha says the results should fall. Ha says that 't' should be positive, which is to the right on the curve, so we shade in that area. The 'idiot-proof' approach is to look at Ha and treat the '>' as an arrow pointing to the tail to shade. We now know which area to select on the Oak Software tool.
Story Problem: You have a theory which predicts that Population 1 should have lower scores than Population 2, which makes this a one-tailed test.
Ha: μ1 < μ2
Analysis: the two-tailed p value is .08
To find the one-tailed p value from this two-tailed p value you need to look at the sample means to see if they fit the theory's prediction. If the theory is correct then M1 should be less than M2 (see Ha above). Let's say that when we look at the sample means we see that M1=34 and M2=56. That fits the theory's prediction!
Now let's say that when we look at the sample means we see that M1=78 and M2=65. The theory's prediction was wrong!
We want to know if it is easier or harder to remember words from a list when the words are common English words compared to when the words are uncommon English words. Condition A: Ten students were shown a list of 50 uncommon English words at the rate of one word per minute. They were then asked to write down as many words as they could remember from the list. The number of words recalled by each student was recorded. Condition B: The same students were also shown a list of 50 common English words at the rate of one word per minute, and the number of words recalled was recorded. Whether the word was a common word or an uncommon word is the independent variable, and the number of words recalled from each list is the dependent variable. To control for possible carryover effects the order in which the students did the task was counterbalanced (half of the students did the common words first and the uncommon words second, and the other half of the students did the uncommon words first and the common words second). The data are given below:
To analyze the data we need to compute the "Difference Scores", which is each student's score in Condition A minus their score in Condition B. The difference scores are included in the table below.
The difference scores tell us how different each student's performance was in Condition A (uncommon words) compared to Condition B (common words), thus the difference scores reflect the effect of the independent variable (uncommon vs. common words). That most of the difference scores are negative numbers indicates that scores were usually higher in Condition B than in Condition A. If H0 is true then the independent variable has no effect, the mean of the difference scores in the population from which these 10 students were sampled is zero, and that these 10 students just happened to do better in Condition A than in Condition B is just due to random sampling error.
Let's take a look at the mean of the ten difference scores in our sample.
So in our sample the mean difference score was -1.9, if H0 is true we would expect that to be around zero. The next step is to see if our mean difference score of -1.9 is significantly different than zero.
Compute the SS of the difference scores (using the computational formula just like you would for any set of numbers).
Now we march through the three steps for the computation of the standard error.
Now we can compute the value of t for our sample mean.
And the degrees of freedom.
We go to the t distribution tool in Oak Software, input df =9, this is a two-tailed test so select that, input t=-2.19, and you will get p=.0563.
p > .05. So...
Say that we have an experiment with four groups (i.e. four levels of the independent variable) with 20 people in each group.
'a' = the number of groups = 4
'Nt'= the total number of scores = (4)(20) = 80.
The partially completed Summary Table is given below, fill out the rest of the table.
The work is shown below:
To find the value of "p" go to the F Tool in Oak Software and enter the df's and value of F. As F has MSBetween in its numerator and MSWithin in its denominator. Enter dfBetween as the 'df Numerator' and dfWithin as the 'df Denominator'.
The completed Summary Table looks like this:
As p > .05:
Let's say we have a factorial design with 4 levels of Independent Variable A and 3 levels of Independent Variable B, and 15 subjects in each treatment combination. This example assumes that you have an up-to-date browser than can handle subscripted text.
"a" is the number of levels of IV A which is 4, so a=4.
"b" is the number of levels of IV B which is 3, so b=3.
The design is shown in the table below:
As we can see, there are 12 treatment combinations (also known as 'cells'). This can also be easily computed. The number of treatment combinations equals 'ab', which can also be written as '(a)(b)', which is simply 'a times b' = (4)(3)=12.
As there are 15 subjects in each treatment combination there must be (12)(15)=180 subjects all together. So, NTotal=180.
The partially completed Summary Table is given below, fill out the rest of the table.
SSTotal = SSA + SSB + SSAB + SSWithin = 45.39 + 13.20 + 107.64 + 873.60
dfA = a-1 = 4 - 1
dfB = b-1 = 3 - 1
dfAB = (a-1)(b-1) = (4 - 1)(3 - 1)
dfWithin = NTotal - (a)(b) = 180 - (4)(3) = 180 - 12
dfTotal = NTotal - 1 = 180 - 1, or, dfA + dfB + dfAB + dfWithin = 3 + 2 + 6 + 168
MSA = SSA / dfA
MSB = SSB / dfB
MSAB = SSAB / dfAB
MSWithin = SSWithin / dfWithin
FA = MSA / MSWithin
FB = MSB / MSWithin
FAB = MSAB / MSWithin
To find the p values for the F's, go to the F Tool in the Oak Software and plug in the appropriate df's and F value. For the FA, for example. F = MSA / MSWithin so the df numerator is dfA and the df denominator is dfWithin.
The table below shows the various means we need to look at to determine whether or not there is an apparent effect due to A, B, and/or AB. We have the means of each cell (e.g. Ma1b1), we have the means of each level of A (e.g. Ma1), and we have the means of each level of B (e.g. Mb1).
Effect Due to A?
To see if there is an effect due to A we compare the mean for a1 (Ma1 which is the mean of all the scores in a1, including cell a1b1 and cell a1b2 and cell a1b3) and the mean for a2 (Ma2) and the mean for a3 (Ma3) and the mean for a4 (Ma4). They are all the same, this tells us that IV A had no effect. Note that in real life, if IV A had no effect the sample means would still differ a little due to random sampling error. I just want to get across the point that if the independent variable had no effect then the means should be very similar.
Effect Due to B?
To see if there is an effect due to B we compare the mean for b1 (Mb1) and the mean for b2 (Mb2) and the mean for b3 (Mb3). They are different, this tells us that IV B apparently did had an effect. Note that we would still have to do the F test for B to see if the differences in the B means are statistically significant (i.e. greater than we would expect if just random error were involved). I just want you to be able to look at the means, see they differ a fair amount, and know that this is what the F test for B is looking at.
Effect Due to AB (Interaction of A and B)?
Let us start off by looking at how IV A affects the scores in b1.
We can see that in the b1 groups, as we move from a1 to a2 the mean goes up by 25, and as we move from a2 to a3 the mean goes down by 10, and as we move from a3 to a4 the mean goes up by 5. These differences show us the effect of A in the b1 groups. If there is no interaction then we will find that same pattern in b2 and in b3, for if A and B don't interact then the effect of A will be the same in each level of b. Let's look at b2.
In b2, as we move from a1 to a2 the mean goes down by 10, that is NOT what happened in b1 (where the mean went up by 25 from a1 to a2). That right there indicates the A and B apparently interact, for the effect of going from one level of A to another level of A depends upon the level of B we are talking about. If we keep going we see other similar indications of interaction. In b2 as we move from a2 to a3 the mean goes down by 5 while in b1 the mean goes down by 10...and so on. The effect of A depends upon whether we are talking about b1 or b2, which means that A and B interact. While there is apparently an interaction between A and B we need to perform the F test for AB to see if that interaction is statistically significant (i.e. more than would be expected just due to random sampling error).
The full table of means if given again below.
In this example we will compute the correlation between X and Y and the regression formula for predicting Y based upon X.
We have the following X and Y scores from 6 people.
First let's look at the scatter plot of the data (generated by the SPSS software program).
We can see that there is a negative correlation between the two variables. Now let's calculate the correlation coefficient. First we create a third column that is everyone's X score times their Y. Then we find the sum of each column.
Now we are ready to do our number crunching. We don't need the means for correlation but we will need them for regression and this is a good time to compute them. When working with correlations, and particularly when working with regression, it is good to go three decimal places to the right of the decimal point (if the unrounded answer goes that far).
So the correlation between X and Y is -0.745, which is negative as we expected from looking at the scatter plot.
The coefficient of determination is r squared:
So 55.5% of the variation in the Y scores can be predicted by, or explained by, or is connected to, the X scores. Also, 55.5% of the variation in the X scores can be predicted by, or explained by, or is connected to, the Y scores.
Now let's see if the correlation is statistically significant. It is a strong correlation, which will help, but N is very small, which will hurt. We will do a two-tailed test.
H0: ρ = 0 (note that is the Greek letter 'rho' not 'p')
Ha: ρ ≠ 0
When we go to the t Tool in Oak Software, input df=4, select a two-tailed test, and input t=-2.237 then we find that p=0.089.
p > .05 so...
The generic form of the regression equation is: Y' = a + bX. To find the regression equation for our sample we need to compute 'a' and 'b'.
So our regression equation is: Y' = 16.502 + (-0.524)X.
If, for example, X=10 then our predicted value for Y would be Y' = 16.502 + (-0.524)10 = 16.502 + (-5.24)=11.262.