Home

Sunday, November 28, 2010

Regression (UNIT -5)

In statistics, regression analysis includes any techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables.

History

The earliest form of regression was the method of least squares (French: méthode des moindres carrés), which was published by Legendre in 1805, and by Gauss in 1809.Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the Sun. Gauss published a further development of the theory of least squares in 1821, including a version of the Gauss–Markov theorem.
The term "regression" was coined by Francis Galton in the nineteenth century to describe a biological phenomenon. The phenomenon was that the heights of descendants of tall ancestors tend to regress down towards a normal average (a phenomenon also known as regression toward the mean). For Galton, regression had only this biological meaning, but his work was later extended by Udny Yule and Karl Pearson to a more general statistical context.. In the work of Yule and Pearson, the joint distribution of the response and explanatory variables is assumed to be Gaussian. This assumption was weakened by R.A. Fisher in his works of 1922 and 1925 . Fisher assumed that the conditional distribution of the response variable is Gaussian, but the joint distribution need not be. In this respect, Fisher's assumption is closer to Gauss's formulation of 1821.
Regression methods continue to be an area of active research. In recent decades, new methods have been developed for robust regression, regression involving correlated responses such as time series and growth curves, regression in which the predictor or response variables are curves, images, graphs, or other complex data objects, regression methods accommodating various types of missing data, nonparametric regression, Bayesian methods for regression, regression in which the predictor variables are measured with error, regression with more predictor variables than observations, and causal inference with regression.

Regression models

Regression models involve the following variables:
  • The unknown parameters denoted as β; this may be a scalar or a vector.
  • The independent variables, X.
  • The dependent variable, Y.
In various fields of application, different terminologies are used in place of dependent and independent variables.
A regression model relates Y to a function of X and β.
Y \approx f (\mathbf {X}, \boldsymbol{\beta} )
The approximation is usually formalized as E(Y | X) = f(X, β). To carry out regression analysis, the form of the function f must be specified. Sometimes the form of this function is based on knowledge about the relationship between Y and X that does not rely on the data. If no such knowledge is available, a flexible or convenient form for f is chosen.
Assume now that the vector of unknown parameters β is of length k. In order to perform a regression analysis the user must provide information about the dependent variable Y:
  • If N data points of the form (Y,X) are observed, where N < k, most classical approaches to regression analysis cannot be performed: since the system of equations defining the regression model is underdetermined, there is not enough data to recover β.
  • If exactly N = k data points are observed, and the function f is linear, the equations Y = f(X, β) can be solved exactly rather than approximately. This reduces to solving a set of N equations with N unknowns (the elements of β), which has a unique solution as long as the X are linearly independent. If f is nonlinear, a solution may not exist, or many solutions may exist.
  • The most common situation is where N > k data points are observed. In this case, there is enough information in the data to estimate a unique value for β that best fits the data in some sense, and the regression model when applied to the data can be viewed as an overdetermined system in β 
In the last case, the regression analysis provides the tools for:
  1. Finding a solution for unknown parameters β that will, for example, minimize the distance between the measured and predicted values of the dependent variable Y (also known as method of least squares).
  2. Under certain statistical assumptions, the regression analysis uses the surplus of information to provide statistical information about the unknown parameters β and predicted values of the dependent variable Y.

Necessary number of independent measurements

Consider a regression model which has three unknown parameters, β0, β1, and β2. Suppose an experimenter performs 10 measurements all at exactly the same value of independent variable vector X (which contains the independent variables X1, X2, and X3). In this case, regression analysis fails to give a unique set of estimated values for the three unknown parameters; the experimenter did not provide enough information. The best one can do is to estimate the average value and the standard deviation of the dependent variable Y. Similarly, measuring at two different values of X would give enough data for a regression with two unknowns, but not for three or more unknowns.

If the experimenter had performed measurements at three different values of the independent variable vector X, then regression analysis would provide a unique set of estimates for the three unknown parameters in β.
In the case of general linear regression, the above statement is equivalent to the requirement that matrix XTX is invertible.

Statistical assumptions

When the number of measurements, N, is larger than the number of unknown parameters, k, and the measurement errors εi are normally distributed then the excess of information contained in (N - k) measurements is used to make statistical predictions about the unknown parameters. This excess of information is referred to as the degrees of freedom of the regression. 

Underlying assumptions

Classical assumptions for regression analysis include:
  • The sample is representative of the population for the inference prediction.
  • The error is a random variable with a mean of zero conditional on the explanatory variables.
  • The independent variables are measured with no error. (Note: If this is not so, modeling may be done instead using errors-in-variables model techniques).
  • The predictors are linearly independent, i.e. it is not possible to express any predictor as a linear combination of the others. See Multicollinearity.
  • The errors are uncorrelated, that is, the variance-covariance matrix of the errors is diagonal and each non-zero element is the variance of the error.
  • The variance of the error is constant across observations (homoscedasticity). (Note: If not, weighted least squares or other methods might instead be used).

Linear regression

In linear regression, the model specification is that the dependent variable, yi is a linear combination of the parameters (but need not be linear in the independent variables). For example, in simple linear regression for modeling n data points there is one independent variable: xi, and two parameters, β0 and β1:
straight line: y_i=\beta_0 +\beta_1 x_i +\varepsilon_i,\quad i=1,\dots,n.\!
In multiple linear regression, there are several independent variables or functions of independent variables. For example, adding a term in xi2 to the preceding regression gives:
parabola: y_i=\beta_0 +\beta_1 x_i +\beta_2 x_i^2+\varepsilon_i,\ i=1,\dots,n.\!
This is still linear regression; although the expression on the right hand side is quadratic in the independent variable xi, it is linear in the parameters β0, β1 and β2.
In both cases, \varepsilon_i is an error term and the subscript i indexes a particular observation. Given a random sample from the population, we estimate the population parameters and obtain the sample linear regression model:
 \widehat{y_i} = \widehat{\beta}_0 + \widehat{\beta}_1 x_i.
The residual,  e_i = y_i - \widehat{y}_i , is the difference between the value of the dependent variable predicted by the model,  \widehat{y_i} and the true value of the dependent variable yi. One method of estimation is ordinary least squares. This method obtains parameter estimates that minimize the sum of squared residuals, SSE, also sometimes denoted RSS:
SSE=\sum_{i=1}^N e_i^2. \,
Minimization of this function results in a set of normal equations, a set of simultaneous linear equations in the parameters, which are solved to yield the parameter estimators, \widehat{\beta}_0, \widehat{\beta}_1.
Illustration of linear regression on a data set.
In the case of simple regression, the formulas for the least squares estimates are
\widehat{\beta_1}=\frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2}\text{ and }\hat{\beta_0}=\bar{y}-\widehat{\beta_1}\bar{x}
where \bar{x} is the mean (average) of the x values and \bar{y} is the mean of the y values. See simple linear regression for a derivation of these formulas and a numerical example. Under the assumption that the population error term has a constant variance, the estimate of that variance is given by:
 \hat{\sigma}^2_\varepsilon = \frac{SSE}{N-2}.\,
This is called the mean square error (MSE) of the regression. The standard errors of the parameter estimates are given by
\hat\sigma_{\beta_0}=\hat\sigma_{\varepsilon} \sqrt{\frac{1}{N} + \frac{\bar{x}^2}{\sum(x_i-\bar x)^2}}
\hat\sigma_{\beta_1}=\hat\sigma_{\varepsilon} \sqrt{\frac{1}{\sum(x_i-\bar x)^2}}.
Under the further assumption that the population error term is normally distributed, the researcher can use these estimated standard errors to create confidence intervals and conduct hypothesis tests about the population parameters.

General linear model

In the more general multiple regression model, there are p independent variables:
 y_i = \beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi} + \varepsilon_i, \,
The least square parameter estimates are obtained by p normal equations. The residual can be written as
e_i=y_i - \hat\beta_0 - \hat\beta_1 x_1 - \cdots - \hat\beta_p x_p.
The normal equations are
\sum_{i=1}^n \sum_{k=1}^p X_{ij}X_{ik}\hat \beta_k=\sum_{i=1}^n X_{ij}y_i,\  j=1,\dots,p.\,
Note that for the normal equations depicted above,  y_i = \beta_1 x_{1i} + \cdots + \beta_p x_{pi} + \varepsilon_i \,
That is, there is no β0. Thus in what follows, \boldsymbol \beta = (\beta_1, \beta_2, \dots, \beta_p).
In matrix notation, the normal equations for k responses (usually k = 1) are written as
\mathbf{_p(X_n^\top X )_p\hat \boldsymbol \beta_k= _pX_n^\top Y_k}.\,
with generalized inverse ( ) solution, subscripts showing matrix dimensions:
\mathbf{_p\hat \boldsymbol \beta_k= {}_p(X_n^\top X )_p^-X_n^\top Y_k}.\,

 

Nonlinear regression

When the model function is not linear in the parameters, the sum of squares must be minimized by an iterative procedure.


 


 


 



 

Wednesday, November 24, 2010

Chi-Square(Unit -4)


Pearson's chi-square is used to assess two types of comparison: tests of goodness of fit and tests of independence. A test of goodness of fit establishes whether or not an observed frequency distribution differs from a theoretical distribution. A test of independence assesses whether paired observations on two variables, expressed in a contingency table, are independent of each other – for example, whether people from different regions differ in the frequency with which they report that they support a political candidate.
The first step in the chi-square test is to calculate the chi-square statistic. In order to avoid ambiguity, the value of the test-statistic is denoted by Χ2 rather than χ2 (i.e. uppercase chi instead of lowercase); this also serves as a reminder that the distribution of the test statistic is not exactly that of a chi-square random variable. However some authors do use the χ2 notation for the test statistic. An exact test which does not rely on using the approximate χ2 distribution is Fisher's exact test: this is significantly more accurate in evaluating the significance level of the test, especially with small numbers of observation.
The chi-square statistic is calculated by finding the difference between each observed and theoretical frequency for each possible outcome, squaring them, dividing each by the theoretical frequency, and taking the sum of the results. A second important part of determining the test statistic is to define the degrees of freedom of the test: this is essentially the number of observed frequencies adjusted for the effect of using some of those observations to define the "theoretical frequencies".
The chi-square (I) test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories. Do the number of individuals or objects that fall in each category differ significantly from the number you would expect? Is this difference between the expected and observed due to sampling error, or is it a real difference?








Ø     Chi-Square Test Requirements

1. Quantitative data.
2. One or more categories.
3. Independent observations.
4. Adequate sample size (at least 10).
5. Simple random sample.
6. Data in frequency form.
7. All observations must be used.
           

Ø     Expected Frequencies

When you find the value for chi square, you determine whether the observed frequencies differ significantly from the expected frequencies. You find the expected frequencies for chi square in three ways:

1 . You hypothesize that all the frequencies are equal in each category. For example, you might expect that half of the entering freshmen class of 200 at Tech College will be identified as women and half as men. You figure the expected frequency by dividing the number in the sample by the number of categories. In this example, where there are 200 entering freshmen and two categories, male and female, you divide your sample of 200 by 2, the number of categories, to get 100 (expected frequencies) in each category.

2. You determine the expected frequencies on the basis of some prior knowledge. Let's use the Tech College example again, but this time pretend we have prior knowledge of the frequencies of men and women in each category from last year's entering class, when 60% of the freshmen were men and 40% were women. This year you might expect that 60% of the total would be men and 40% would be women. You find the expected frequencies by multiplying the sample size by each of the hypothesized population proportions. If the freshmen total were 200, you would expect 120 to be men (60% x 200) and 80 to be women (40% x 200).Now let's take a situation, find the expected frequencies, and use the chi-square test to solve the problem.






Ø     Situation

Thai, the manager of a car dealership, did not want to stock cars that were bought less frequently because of their unpopular color. The five colors that he ordered were red, yellow, green, blue, and white. According to That, the expected frequencies or number of customers choosing each color should follow the percentages of last year. She felt 20% would choose yellow, 30% would choose red, 10% would choose green, 10% would choose blue, and 30% would choose white. She now took a random sample of 150 customers and asked them their color preferences. The results of this poll are shown in Table 1 under the column labeled _observed frequencies."


Table 1 - Color Preference for 150 Customers for Thai's Superior Car Dealership
Category Color
Observed Frequencies
Expected Frequencies
Yellow
35
30
Red
50
45
Green
30
15
Blue
10
15
White
25
45


last year, we would expect 20% to choose yellow. Figure the expected frequencies for yellow by taking 20% of the 150 customers, getting an expected frequency of 30 people for this category. For the color red we would expect 30% out of 150 or 45 people to fall in this category. Using this method, Thai figured out the expected frequencies 30, 45, 15, 15, and 45. Obviously, there are discrepancies between the colors preferred by customers in the poll taken by Thai and the colors preferred by the customers who bought their cars last year. Most striking is the difference in the green and white colors. If Thai were to follow the results of her poll, she would stock twice as many green cars than if she were to follow the customer color preference for green based on last year's sales.In the case of white cars, she would stock half as many this year. What to do??? Thai needs to know whether or not the discrepancies between last year's choices (expected frequencies) and this year's preferences on the basis of his poll (observed frequencies) demonstrate a real change in customer color preferences. It could be that the differences are simply a result of the random sample she chanced to select. If so, then the population of customers really has not changed from last year as far as color preferences go. The null hypothesis states that there is no significant difference between the expected and observed frequencies. The alternative hypothesis states they are different. The level of significance (the point at which you can say with 95% confidence that the difference is NOT due to chance alone) is set at .05 (the standard for most science experiments.) The chi-square
formula used on these data is

X2 = (O - E)2 where O is the Observed Frequency in each category

E E is the Expected Frequency in the corresponding category
df is the "degree of freedom" (n-1)
X2 is Chi Square




Ø     PROCEDURE

We are now ready to use our formula for X2 and find out if there is a significant difference between the observed and expected frequencies for the customers in choosing cars. We will set up a worksheet; then you will follow the directions to form the columns and solve the formula.




2. After calculating the Chi Square value, find the "Degrees of Freedom." (DO NOT SQUARE THE NUMBER YOU GET, NOR FIND THE SQUARE ROOT – THE NUMBER YOU GET FROM COMPLETING THE CALCULATIONS AS ABOVE IS _CHI SQUARE.)
Degrees of freedom (df) refers to the number of values that are free to vary after restriction has been placed on the data. For instance, if you have four numbers with the restriction that their sum has to be 50, and then three of these numbers can be anything, they are free to vary, but the fourth number definitely is restricted. For example, the first three numbers could be 15, 20, and 5, adding up to 40; then the fourth number has to be 10 in order that they sum to 50. The degrees of freedom for these values are then three. The degrees of freedom here is defined as N - 1, the number in the group minus one restriction (4 - I ).

3. Find the table value for Chi Square. Begin by finding the df found in step 2 along the left hand side of the table. Run your fingers across the proper row until you reach the predetermined level of significance (.05) at the column heading on the top of the table. The table value for Chi Square in the correct box of 4 df and P=.05 level of significance is 9.49.

4. If the calculated chi-square value for the set of data you are analyzing (26.95) is equal to or greater than the table value (9.49 ), reject the null hypothesis. There IS a significant difference between the data sets that cannot be due to chance alone. If the number you calculate is LESS than the number you find on the table, than you can probably say that any differences are due to chance alone.

In this situation, the rejection of the null hypothesis means that the differences between the expected frequencies (based upon last year's car sales) and the observed frequencies (based upon this year's poll taken by Thai) are not due to chance. That is, they are not due to chance variation in the sample Thai took; there is a real difference between them. Therefore, in deciding what color autos to stock, it would be to Thai's advantage to pay careful attention to the results of her poll!










Ø     The steps in using the chi-square test may be summarized as follows:

ü     Chi-square

I. Write the observed frequencies in column O
2. Figure the expected frequencies and write them in column E.
3. Use the formula to find the chi-square value:
4. Find the df. (N-1)
5. Find the table value (consult the Chi Square Table.)
6. If your chi-square value is equal to or greater than the table value,
    reject the null
                    Hypothesis: differences in your data are not due to chance alone



Ø     Problem

            The approximation to the chi-square distribution breaks down if expected frequencies are too low. It will normally be acceptable so long as no more than 20% of the events have expected frequencies below 5. Where there is only 1 degree of freedom, the approximation is not reliable if expected frequencies are below 10. In this case, a better approximation can be obtained by reducing the absolute value of each difference between observed and expected frequencies by 0.5 before squaring; this is called Yates' correction for continuity.
In cases where the expected value, E, is found to be small (indicating either a small underlying population probability, or a small number of observations), the normal approximation of the multinomial distribution can fail, and in such cases it is found to be more appropriate to use the G-test, a likelihood ratio-based test statistic. Where the total sample size is small, it is necessary to use an appropriate exact test, typically either the binomial test or (for contingency tables) Fisher's exact test; but note that this test assumes fixed and known marginal totals.



Monday, October 25, 2010

UNIT -4 ( TESTING)

 Symbols 


  • α, the probability of Type I error (rejecting a null hypothesis when it is in fact true)
  • n = sample size
  • n1 = sample 1 size
  • n2 = sample 2 size
  • \overline{x} = sample mean
  • μ0 = hypothesized population mean
  • μ1 = population 1 mean
  • μ2 = population 2 mean
  • σ = population standard deviation
  • σ2 = population variance
  • s = sample standard deviation
  • s2 = sample variance
  • s1 = sample 1 standard deviation
  • s2 = sample 2 standard deviation
  • t = t statistic
  • df = degrees of freedom
  • \overline{d} = sample mean of differences
  • d0 = hypothesized population mean difference
  • sd = standard deviation of differences
  • \hat{p} = x/n = sample proportion, unless specified otherwise
  • p0 = hypothesized population proportion
  • p1 = proportion 1
  • p2 = proportion 2
  • dp = hypothesized difference in proportion
  • min{n1,n2} = minimum of n1 and n2
  • x1 = n1p1
  • x2 = n2p2
  • χ2 = Chi-squared statistic
  • F = F statistic


Theory of t- Distribution 

According to the central limit theorem, the sampling distribution of a statistic (like a sample mean) will follow a normal distribution, as long as the sample size is sufficiently large. Therefore, when we know the standard deviation of the population, we can compute a z-score, and use the normal distribution to evaluate probabilities with the sample mean.
But sample sizes are sometimes small, and often we do not know the standard deviation of the population. When either of these problems occur, statisticians rely on the distribution of the t statistic (also known as the t score), whose values are given by:
t = [ x - μ ] / [ s / sqrt( n ) ]
where x is the sample mean, μ is the population mean, s is the standard deviation of the sample, and n is the sample size. The distribution of the t statistic is called the t distribution or the Student t distribution.

Degrees of Freedom

There are actually many different t distributions. The particular form of the t distribution is determined by its degrees of freedom. The degrees of freedom refers to the number of independent observations in a set of data.
When estimating a mean score or a proportion from a single sample, the number of independent observations is equal to the sample size minus one. Hence, the distribution of the t statistic from samples of size 8 would be described by a t distribution having 8 - 1 or 7 degrees of freedom. Similarly, a t distribution having 15 degrees of freedom would be used with a sample of size 16.
For other applications, the degrees of freedom may be calculated differently. We will describe those computations as they come up.

Properties of the t Distribution

The t distribution has the following properties:
  • The mean of the distribution is equal to 0 .
  • The variance is equal to v / ( v - 2 ), where v is the degrees of freedom (see last section) and v > 2.
  • The variance is always greater than 1, although it is close to 1 when there are many degrees of freedom. With infinite degrees of freedom, the t distribution is the same as the standard normal distribution.

When to Use the t Distribution

The t distribution can be used with any statistic having a bell-shaped distribution (i.e., approximately normal). The central limit theorem states that the sampling distribution of a statistic will be normal or nearly normal, if any of the following conditions apply.
  • The population distribution is normal.
  • The sampling distribution is symmetric, unimodal, without outliers, and the sample size is 15 or less.
  • The sampling distribution is moderately skewed, unimodal, without outliers, and the sample size is between 16 and 40.
  • The sample size is greater than 40, without outliers.
The t distribution should not be used with small samples from populations that are not approximately normal.


Probability and the Student t Distribution

When a sample of size n is drawn from a population having a normal (or nearly normal) distribution, the sample mean can be transformed into a t score, using the equation presented at the beginning of this lesson. We repeat that equation below:
t = [ x - μ ] / [ s / sqrt( n ) ]
where x is the sample mean, μ is the population mean, s is the standard deviation of the sample, n is the sample size, and degrees of freedom are equal to n - 1.
The t score produced by this transformation can be associated with a unique cumulative probability. This cumulative probability represents the likelihood of finding a sample mean less than or equal to x, given a random sample of size n.


Notation and t Scores

Statisticians use tα to represent the t-score that has a cumulative probability of (1 - α). For example, suppose we were interested in the t-score having a cumulative probability of 0.95. In this example, α would be equal to (1 - 0.95) or 0.05. We would refer to the t-score as t0.05
Of course, the value of t0.05 depends on the number of degrees of freedom. For example, with 2 degrees of freedom, that t0.05 is equal to 2.92; but with 20 degrees of freedom, that t0.05 is equal to 1.725.
Note: Because the t distribution is symmetric about a mean of zero, the following is true.
tα = -t1 - alpha       And       t1 - alpha = -tα
Thus, if t0.05 = 2.92, then t0.95 = -2.92.

 Example 

  1. The Acme Chain Company claims that their chains have an average breaking strength of 20,000 pounds, with a standard deviation of 1750 pounds. Suppose a customer tests 14 randomly-selected chains. What is the probability that the average breaking strength in the test will be no more than 19,800 pounds?

    Solution:

    One strategy would be a two-step approach:

    • Compute a t score, assuming that the mean of the sample test is 19,800 pounds.
    • Determine the cumulative probability for that t score.

    We will follow that strategy here. First, we compute the t score:

    t = [ x - μ ] / [ s / sqrt( n ) ]
    t = (19,800 - 20,000) / [ 1750 / sqrt(14) ]
    t = ( -200 ) / [ (1750) / (3.74166) ] = ( -200 ) / (467.707) = -0.4276

    where x is the sample mean, μ is the population mean, s is the standard deviation of the sample, n is the sample size, and t is the t score.

    Now, we can determine the cumulative probability for the t score. We know the following:

    • The t score is equal to -0.4276.
    • The number of degrees of freedom is equal to 13. (In situations like this, the number of degrees of freedom is equal to number of observations minus 1. Hence, the number of degrees of freedom is equal to 14 - 1 or 13.)

    Now, we are ready to use the T Distribution Calculator. Since we have already computed the t score, we select "t score" from the drop-down box. Then, we enter the t score (-4.276) and the degrees of freedom (13) into the calculator, and hit the Calculate button. The calculator reports that the cumulative probability is 0.338. Therefore, there is a 33.8% chance that the average breaking strength in the test will be no more than 19,800 pounds.

    Note: The strategy that we used required us to first compute a t score, and then use the T Distribution Calculator to find the cumulative probability. An alternative strategy, which does not require us to compute a t score, would be to use the calculator in the "Sample mean" mode. That strategy may be a little bit easier. It is illustrated in the next example.
  2. Let's look again at the problem that we addressed above in Example 1. This time, we will illustrate a different, easier strategy to solve the problem.

    Here, once again, is the problem: The Acme Chain Company claims that their chains have an average breaking strength of 20,000 pounds, with a standard deviation of 1750 pounds. Suppose a customer tests 14 randomly-selected chains. What is the probability that the average breaking strength in the test will be no more than 19,800 pounds?

    Solution:

    We know the following:

    • The population mean is 20,000.
    • The standard deviation is 1750.
    • The sample mean, for which we want to find a cumulative probability, is 19,800.
    • The number of degrees of freedom is 13. (In situations like this, the number of degrees of freedom is equal to number of observations minus 1. Hence, the number of degrees of freedom is equal to 14 - 1 or 13.)

    First, we select "Sample mean" from the dropdown box, in the T Distribution Calculator. Then, we plug our known input (degrees of freedom, sample mean, standard deviation, and population mean) into the T Distribution Calculator and hit the Calculate button. The calculator reports that the cumulative probability is 0.338. Thus, there is a 33.8% probability that an Acme chain will snap under 19,800 pounds of stress.

    Note: This is the same answer that we found in Example 1. However, the approach that we followed in this example may be a little bit easier than the approach that we used in the previous example, since this approach does not require us to compute a t score.
  3. The school board administered an IQ test to 25 randomly selected teachers. They found that the average IQ score was 115 with a standard deviation of 11. Assume that the cumulative probability is 0.90. What population mean would have produced this sample result?

    Note: In this situation, a cumulative probability of 0.90 suggests that 90% of the random samples drawn from the teacher population will have an average IQ of 115 or less. This problem asks you to find the true population IQ for which this would be true.

    Solution:

    We know the following:

    • The cumulative probability is 0.90.
    • The standard deviation is 11.
    • The sample mean is 115.
    • The number of degrees of freedom is 24. (In situations like this, the number of degrees of freedom is equal to number of observations minus 1. Hence, the number of degrees of freedom is equal to 25 - 1 or 24.)

    First, we select "Sample mean" from the dropdown box, in the T Distribution Calculator. Then, we plug the known inputs (cumulative probability, standard deviation, sample mean, and degrees of freedom) into the calculator and hit the Calculate button. The calculator reports that the population mean is 112.1.

    Here is what this means. Suppose we randomly sampled every possible combination of 25 teachers. If the true population mean were 112.1, we would expect 90% of our samples to have a sample mean of 115 or less.

Sunday, October 10, 2010

UNIT -3 (SAMPLING)

INTRODUCTION SAMPLING

Sampling is the process of selecting units (e.g., people, organizations) from a population of interest so that by studying the sample we may fairly generalize our results back to the population from which they were chosen. Let's begin by covering some of the key terms in sampling like "population" and "sampling frame." Then, because some types of sampling rely upon quantitative models, we'll talk about some of the statistical terms used in sampling. Finally, we'll discuss the major distinction between probability and Nonprobability sampling methods and work through the major types in each.


External Validity

External validity is related to generalizing. That's the major thing you need to keep in mind. Recall that validity refers to the approximate truth of propositions, inferences, or conclusions. So, external validity refers to the approximate truth of conclusions the involve generalizations. Put in more pedestrian terms, external validity is the degree to which the conclusions in your study would hold for other persons in other places and at other times.
In science there are two major approaches to how we provide evidence for a generalization. I'll call the first approach the Sampling Model. In the sampling model, you start by identifying the population you would like to generalize to. Then, you draw a fair sample from that population and conduct your research with the sample. Finally, because the sample is representative of the population, you can automatically generalize your results back to the population. There are several problems with this approach. First, perhaps you don't know at the time of your study who you might ultimately like to generalize to. Second, you may not be easily able to draw a fair or representative sample. Third, it's impossible to sample across all times that you might like to generalize to (like next year).
I'll call the second approach to generalizing the Proximal Similarity Model. 'Proximal' means 'nearby' and 'similarity' means... well, it means 'similarity'. The term proximal similarity was suggested by Donald T. Campbell as an appropriate relabeling of the term external validity (although he was the first to admit that it probably wouldn't catch on!). Under this model, we begin by thinking about different generalizability contexts and developing a theory about which contexts are more like our study and which are less so. For instance, we might imagine several settings that have people who are more similar to the people in our study or people who are less similar. This also holds for times and places. When we place different contexts in terms of their relative similarities, we can call this implicit theoretical a gradient of similarity. Once we have developed this proximal similarity framework, we are able to generalize. How? We conclude that we can generalize the results of our study to other persons, places or times that are more like (that is, more proximally similar) to our study. Notice that here, we can never generalize with certainty -- it is always a question of more or less similar.

Threats to External Validity

A threat to external validity is an explanation of how you might be wrong in making a generalization. For instance, you conclude that the results of your study (which was done in a specific place, with certain types of people, and at a specific time) can be generalized to another context (for instance, another place, with slightly different people, at a slightly later time). There are three major threats to external validity because there are three ways you could be wrong -- people, places or times. Your critics could come along, for example, and argue that the results of your study are due to the unusual type of people who were in the study. Or, they could argue that it might only work because of the unusual place you did the study in (perhaps you did your educational study in a college town with lots of high-achieving educationally-oriented kids). Or, they might suggest that you did your study in a peculiar time. For instance, if you did your smoking cessation study the week after the Surgeon General issues the well-publicized results of the latest smoking and cancer studies, you might get different results than if you had done it the week before.

Improving External Validity

How can we improve external validity? One way, based on the sampling model, suggests that you do a good job of drawing a sample from a population. For instance, you should use random selection, if possible, rather than a nonrandom procedure. And, once selected, you should try to assure that the respondents participate in your study and that you keep your dropout rates low. A second approach would be to use the theory of proximal similarity more effectively. How? Perhaps you could do a better job of describing the ways your contexts and others differ, providing lots of data about the degree of similarity between various groups of people, places, and even times. You might even be able to map out the degree of proximal similarity among various contexts with a methodology like concept mapping. Perhaps the best approach to criticisms of generalizations is simply to show them that they're wrong -- do your study in a variety of places, with different people and at different times


Sampling Terminology

As with anything else in life you have to learn the language of an area if you're going to ever hope to use it. Here, I want to introduce several different terms for the major groups that are involved in a sampling process and the role that each group plays in the logic of sampling.
The major question that motivates sampling in the first place is: "Who do you want to generalize to?" Or should it be: "To whom do you want to generalize?" In most social research we are interested in more than just the people who directly participate in our study. We would like to be able to talk in general terms and not be confined only to the people who are in our study. Now, there are times when we aren't very concerned about generalizing. Maybe we're just evaluating a program in a local agency and we don't care whether the program would work with other people in other places and at other times. In that case, sampling and generalizing might not be of interest. In other cases, we would really like to be able to generalize almost universally. When psychologists do research, they are often interested in developing theories that would hold for all humans. But in most applied social research, we are interested in generalizing to specific groups. The group you wish to generalize to is often called the population in your study. This is the group you would like to sample from because this is the group you are interested in generalizing to. Let's imagine that you wish to generalize to urban homeless males between the ages of 30 and 50 in the United States. If that is the population of interest, you are likely to have a very hard time developing a reasonable sampling plan. You are probably not going to find an accurate listing of this population, and even if you did, you would almost certainly not be able to mount a national sample across hundreds of urban areas. So we probably should make a distinction between the population you would like to generalize to, and the population that will be accessible to you. We'll call the former the theoretical population and the latter the accessible population. In this example, the accessible population might be homeless males between the ages of 30 and 50 in six selected urban areas across the U.S.
Once you've identified the theoretical and accessible populations, you have to do one more thing before you can actually draw a sample -- you have to get a list of the members of the accessible population. (Or, you have to spell out in detail how you will contact them to assure representativeness). The listing of the accessible population from which you'll draw your sample is called the sampling frame. If you were doing a phone survey and selecting names from the telephone book, the book would be your sampling frame. That wouldn't be a great way to sample because significant subportions of the population either don't have a phone or have moved in or out of the area since the last book was printed. Notice that in this case, you might identify the area code and all three-digit prefixes within that area code and draw a sample simply by randomly dialing numbers (cleverly known as random-digit-dialing). In this case, the sampling frame is not a list per se, but is rather a procedure that you follow as the actual basis for sampling. Finally, you actually draw your sample (using one of the many sampling procedures). The sample is the group of people who you select to be in your study. Notice that I didn't say that the sample was the group of people who are actually in your study. You may not be able to contact or recruit all of the people you actually sample, or some could drop out over the course of the study. The group that actually completes your study is a subsample of the sample -- it doesn't include nonrespondents or dropouts. The problem of nonresponse and its effects on a study will be addressed when discussing "mortality" threats to internal validity.
People often confuse what is meant by random selection with the idea of random assignment. You should make sure that you understand the distinction between random selection and random assignment.
At this point, you should appreciate that sampling is a difficult multi-step process and that there are lots of places you can go wrong. In fact, as we move from each step to the next in identifying a sample, there is the possibility of introducing systematic error or bias. For instance, even if you are able to identify perfectly the population of interest, you may not have access to all of them. And even if you do, you may not have a complete and accurate enumeration or sampling frame from which to select. And, even if you do, you may not draw the sample correctly or accurately. And, even if you do, they may not all come and they may not all stay. Depressed yet? This is a very difficult business indeed. At times like this I'm reminded of what Donald Campbell used to say (I'll paraphrase here): "Cousins to the amoeba, it's amazing that we know anything at all!"

Statistical Terms in Sampling

Let's begin by defining some very simple terms that are relevant here. First, let's look at the results of our sampling efforts. When we sample, the units that we sample -- usually people -- supply us with one or more responses. In this sense, a response is a specific measurement value that a sampling unit supplies. In the figure, the person is responding to a survey instrument and gives a response of '4'. When we look across the responses that we get for our entire sample, we use a statistic. There are a wide variety of statistics we can use -- mean, median, mode, and so on. In this example, we see that the mean or average for the sample is 3.75. But the reason we sample is so that we might get an estimate for the population we sampled from. If we could, we would much prefer to measure the entire population. If you measure the entire population and calculate a value like a mean or average, we don't refer to this as a statistic, we call it a parameter of the population.

The Sampling Distribution

So how do we get from our sample statistic to an estimate of the population parameter? A crucial midway concept you need to understand is the sampling distribution. In order to understand it, you have to be able and willing to do a thought experiment. Imagine that instead of just taking a single sample like we do in a typical study, you took three independent samples of the same population. And furthermore, imagine that for each of your three samples, you collected a single response and computed a single statistic, say, the mean of the response. Even though all three samples came from the same population, you wouldn't expect to get the exact same statistic from each. They would differ slightly just due to the random "luck of the draw" or to the natural fluctuations or vagaries of drawing a sample. But you would expect that all three samples would yield a similar statistical estimate because they were drawn from the same population. Now, for the leap of imagination! Imagine that you did an infinite number of samples from the same population and computed the average for each one. If you plotted them on a histogram or bar graph you should find that most of them converge on the same central value and that you get fewer and fewer samples that have averages farther away up or down from that central value. In other words, the bar graph would be well described by the bell curve shape that is an indication of a "normal" distribution in statistics. The distribution of an infinite number of samples of the same size as the sample in your study is known as the sampling distribution. We don't ever actually construct a sampling distribution. Why not? You're not paying attention! Because to construct it we would have to take an infinite number of samples and at least the last time I checked, on this planet infinite is not a number we know how to reach. So why do we even talk about a sampling distribution? Now that's a good question! Because we need to realize that our sample is just one of a potentially infinite number of samples that we could have taken. When we keep the sampling distribution in mind, we realize that while the statistic we got from our sample is probably near the center of the sampling distribution (because most of the samples would be there) we could have gotten one of the extreme samples just by the luck of the draw. If we take the average of the sampling distribution -- the average of the averages of an infinite number of samples -- we would be much closer to the true population average -- the parameter of interest. So the average of the sampling distribution is essentially equivalent to the parameter. But what is the standard deviation of the sampling distribution (OK, never had statistics? There are any number of places on the web where you can learn about them or even just brush up if you've gotten rusty. This isn't one of them. I'm going to assume that you at least know what a standard deviation is, or that you're capable of finding out relatively quickly). The standard deviation of the sampling distribution tells us something about how different samples would be distributed. In statistics it is referred to as the standard error (so we can keep it separate in our minds from standard deviations. Getting confused? Go get a cup of coffee and come back in ten minutes...OK, let's try once more... A standard deviation is the spread of the scores around the average in a single sample. The standard error is the spread of the averages around the average of averages in a sampling distribution. Got it?)

Sampling Error

In sampling contexts, the standard error is called sampling error. Sampling error gives us some idea of the precision of our statistical estimate. A low sampling error means that we had relatively less variability or range in the sampling distribution. But here we go again -- we never actually see the sampling distribution! So how do we calculate sampling error? We base our calculation on the standard deviation of our sample. The greater the sample standard deviation, the greater the standard error (and the sampling error). The standard error is also related to the sample size. The greater your sample size, the smaller the standard error. Why? Because the greater the sample size, the closer your sample is to the actual population itself. If you take a sample that consists of the entire population you actually have no sampling error because you don't have a sample, you have the entire population. In that case, the mean you estimate is the parameter.

The 68, 95, 99 Percent Rule

You've probably heard this one before, but it's so important that it's always worth repeating... There is a general rule that applies whenever we have a normal or bell-shaped distribution. Start with the average -- the center of the distribution. If you go up and down (i.e., left and right) one standard unit, you will include approximately 68% of the cases in the distribution (i.e., 68% of the area under the curve). If you go up and down two standard units, you will include approximately 95% of the cases. And if you go plus-and-minus three standard units, you will include about 99% of the cases. Notice that I didn't specify in the previous few sentences whether I was talking about standard deviation units or standard error units. That's because the same rule holds for both types of distributions (i.e., the raw data and sampling distributions). For instance, in the figure, the mean of the distribution is 3.75 and the standard unit is .25 (If this was a distribution of raw data, we would be talking in standard deviation units. If it's a sampling distribution, we'd be talking in standard error units). If we go up and down one standard unit from the mean, we would be going up and down .25 from the mean of 3.75. Within this range -- 3.5 to 4.0 -- we would expect to see approximately 68% of the cases. This section is marked in red on the figure. I leave to you to figure out the other ranges. But what does this all mean you ask? If we are dealing with raw data and we know the mean and standard deviation of a sample, we can predict the intervals within which 68, 95 and 99% of our cases would be expected to fall. We call these intervals the -- guess what -- 68, 95 and 99% confidence intervals.
Now, here's where everything should come together in one great aha! experience if you've been following along. If we had a sampling distribution, we would be able to predict the 68, 95 and 99% confidence intervals for where the population parameter should be! And isn't that why we sampled in the first place? So that we could predict where the population is on that variable? There's only one hitch. We don't actually have the sampling distribution (now this is the third time I've said this in this essay)! But we do have the distribution for the sample itself. And we can from that distribution estimate the standard error (the sampling error) because it is based on the standard deviation and we have that. And, of course, we don't actually know the population parameter value -- we're trying to find that out -- but we can use our best estimate for that -- the sample statistic. Now, if we have the mean of the sampling distribution (or set it to the mean from our sample) and we have an estimate of the standard error (we calculate that from our sample) then we have the two key ingredients that we need for our sampling distribution in order to estimate confidence intervals for the population parameter.
Perhaps an example will help. Let's assume we did a study and drew a single sample from the population. Furthermore, let's assume that the average for the sample was 3.75 and the standard deviation was .25. This is the raw data distribution depicted above. now, what would the sampling distribution be in this case? Well, we don't actually construct it (because we would need to take an infinite number of samples) but we can estimate it. For starters, we assume that the mean of the sampling distribution is the mean of the sample, which is 3.75. Then, we calculate the standard error. To do this, we use the standard deviation for our sample and the sample size (in this case N=100) and we come up with a standard error of .025 (just trust me on this). Now we have everything we need to estimate a confidence interval for the population parameter. We would estimate that the probability is 68% that the true parameter value falls between 3.725 and 3.775 (i.e., 3.75 plus and minus .025); that the 95% confidence interval is 3.700 to 3.800; and that we can say with 99% confidence that the population value is between 3.675 and 3.825. The real value (in this fictitious example) was 3.72 and so we have correctly estimated that value with our sample.

Probability Sampling

A probability sampling method is any method of sampling that utilizes some form of random selection. In order to have a random selection method, you must set up some process or procedure that assures that the different units in your population have equal probabilities of being chosen. Humans have long practiced various forms of random selection, such as picking a name out of a hat, or choosing the short straw. These days, we tend to use computers as the mechanism for generating random numbers as the basis for random selection.

Some Definitions

Before I can explain the various probability methods we have to define some basic terms. These are:
  • N = the number of cases in the sampling frame
  • n = the number of cases in the sample
  • NCn = the number of combinations (subsets) of n from N
  • f = n/N = the sampling fraction
That's it. With those terms defined we can begin to define the different probability sampling methods.


Nonprobability Sampling

The difference between nonprobability and probability sampling is that nonprobability sampling does not involve random selection and probability sampling does. Does that mean that nonprobability samples aren't representative of the population? Not necessarily. But it does mean that nonprobability samples cannot depend upon the rationale of probability theory. At least with a probabilistic sample, we know the odds or probability that we have represented the population well. We are able to estimate confidence intervals for the statistic. With nonprobability samples, we may or may not represent the population well, and it will often be hard for us to know how well we've done so. In general, researchers prefer probabilistic or random sampling methods over nonprobabilistic ones, and consider them to be more accurate and rigorous. However, in applied social research there may be circumstances where it is not feasible, practical or theoretically sensible to do random sampling. Here, we consider a wide range of nonprobabilistic alternatives.


Sampling distribution

The sampling distribution of the mean is a very important distribution 

 The standard deviation of the sampling distribution of the statistic is referred to as the standard error of that quantity. For the case where the statistic is the sample mean, the standard error is:

\sigma_{\bar x} = \frac{\sigma}{\sqrt{n}}
where σ is the standard deviation of the population distribution of that quantity and n is the size (number of items) in the sample.



Monday, September 20, 2010

Permutation & Combination

Permutation : Permutation means arrangement of things. The word arrangement is used, if the order of things is considered.
Combination: Combination means selection of things. The word selection is used, when the order of things has no importance.

Example:     Suppose we have to form a number of consisting of three digits using the digits 1,2,3,4, To form this number the digits have to be arranged. Different numbers will get formed depending upon the order in which we arrange the digits. This is an example of Permutation.


Now suppose that we have to make a team of 11 players out of 20 players, This is an example of combination, because the order of players in the team will not result in a change in the team. No matter in which order we list out the players the team will remain the same! For a different team to be formed at least one player will have to be changed.