Reliability and Validity of Test

Test scores must be trustworthy if they are to be used for scientific purposes. To a psychologist this means that they must be both reliable and valid.

Test scores are reliable when they are dependable, reproducible and consistent. Confusing or tricky tests may mean different things to a tested at different times. Tests may be too short to be reliable, or scoring may be too subjective. If a test is inconsistent in its results when measurements are repeated or when it is scored by two people.

It is unreliable. A simple analogy is a rubber yardstick. If we did not know how much it stretched each time we tool a measurement, the results would be unreliable, no matter how carefully we had marked the measurement. We need reliable tests if we are to use the results with confidence.

ADVERTISEMENTS:

In order to evaluate reliability, we must secure two independent scores for the same individual on the same test-by treating halves of the test separately, by repeating the test, or by giving it in two different but equivalent forms. If we have such a set of paired scores from a group of individuals, we can determine the test’s reliability.

If the same relative scores levels are preserved on the two measurements, the test is reliable. Some difference is to be expected, owing to errors of measurement, so that an index of degree of relationship between the two sets of scores is needed. This relationship is provided by the coefficient of correlation, already familiar to us as a measure of degree of correspondence between the two sets of test scores. The coefficient of correlation between the two sets of test scores is a reliability coefficient. Well-constructed psychological tests of ability usually have reliability coefficients of r = 0.90 or above.

Tests are valid when they measure what they are intended to measure. A college examination in economics full of trick questions might be a test of student intelligence rather than of the economics that was to have been learned in the course. Such an examination might be reliable, but it would not be a valid test of achievement for the course.

A test of sense of humour, for example, might be made up of jokes that were hard catch unless one was both very bright and very well read. Hence it might turn out to be a reliable test of something (intelligence? educational achievement?) but still not be valid as a test of sense of humour.

ADVERTISEMENTS:

To measure validity, we must also have two scores for each person the test score and some measure of what the test is supposed to be measuring. This measure is called a criterion. Suppose that a test is designed to predict success in learning to receive telegraphic code. To determine whether the test is valid, it is given to a group of individuals before they start their study of telegraphy.

After they have been trained to receive coded messages, the students are tested on the number of words per minute they can receive. This later measure furnishes an additional set of scores, which serves as a criterion. Now we can obtain a coefficient of correlation between the early test scores and the scores on the criterion.

This correlation coefficient is known as a validity coefficient, and it tells something about how valuable a given test is for a given purpose. The higher the validity coefficient, the better prediction that can the made from an aptitude test.

High validly coefficient is desirable if test scores are to be used to help an individual with an important decision such as vocational choice. But even relatively low validity coefficient may prove useful when large numbers of people are tested.

ADVERTISEMENTS:

For example, a battery of tests used for the selection of air-crew specialist in the Second World War proved effective in predicting job success, even though some of the validity coefficients for single tests were of very moderate size. Illustrative validity coefficients from this battery are shown in Table 9.1. Although no single test showed validity above 0.49, the “composite” score derived from the battery of tests correlated 0.64 with the criterion.

Test Scores as a Basis for Prediction

With high reliability and validity coefficients we know the test is satisfactory, but the problem of using the test in prediction still remains. The method of prediction most easily understood is the one based on critical scores. By this method, a critical point on the scale of stresses is selected. Only those candidates with scores above the critical point are accepted-for pilot training, for admission to medical school, or for whatever purpose the testing may serve.

The pilot-selection program of the Air Force illustrates this use of critical scores. The composite scores (called stoniness) give each candidate a pilot-prediction rating from 1 to 9. Figure 9.1 shows that those with low stoniness failed pilot training much more frequently than those with high stoniness. After experience with the tests, the examiners eliminated those with stoniness below 5 prior to training.

ADVERTISEMENTS:

Thus stamina of 5 is a critical score. Had this critical score been adopted before training the candidates represented in Fig. 9.1, only 17 percent of those accepted would have failed to complete training. Those dropped would have been the group of low scores, 54 per cent of whom failed elementary pilot training.