B.ed 1st YEAR [ ES - 333 ]


S-333: EDUCATIONAL EVALUATION

Answer the following questions:

i) Establish relationship between Measurement, Assessment and Evaluation with examples.
(250 words)

INTRODUCTION :

Assessment, measurement, evaluation and research are part of the processes of science and issues related to each topic often overlap. Assessment refers to the collection of data to better understand an issue, measurement is the process of quantifying assessment data, evaluation refers to the comparison of that data to a standard for the purpose of judging worth or quality, and research refers to the use of that data for the purpose of describing, predicting, and controlling as a means toward better understanding the phenomena under consideration. Measurement is done with respect to "variables" (phenomena that can take on more than one value or level).


RELATIONSHIP BETWEEN MEASUREMENT, ASSESSMENT AND EVALUATION WITH EXAMPLES.

The collecting of data (assessment), quantifying those data (measurement) and developing understanding about the data (research) always raise issues of reliability and validity. Reliability attempts to answer concerns about the consistency of the information (data) collected, while validity focuses on accuracy or truth. The relationship between reliability and validity can be confusing because measurements (e.g., scores on tests, recorded statements about classroom behavior) can be reliable (consistent) without being valid (accurate or true). However, the reverse is not true: measurements cannot be valid without being reliable. The same statement applies to findings from research studies. Findings may be reliable (consistent across studies), but not valid (accurate or true statements about relationships among "variables"), but findings may not be valid if they are not reliable. At a miniumum, for a measurement to be reliable a consistent set of data must be produced each time it is used; for a research study to be reliable it should produce consistent results each time it is performed.

For example, the variable "gender" has the values or levels of male and female and data could be collected relative to this variable. Data on variables are normally collected by one or more of four methods: paper/pencil, systematic observation, participant observation, and clinical.
CLASSROOM ASSESSMENT
Three issues are important for classroom assessment (data collection with regards to student learning that is under the control of the teacher.) The first relates to what data teachers will use for making judgments (qualitative or quantitative); a second issue revolves around when they will collect data (formative vs. summative assessment.) A third issue revolves around the reference to be used for making evaluations (criterion- versus norm-referenced); a fourth relates to how teachers will communicate their judgments to others (authentic assessment, portfolios, and grading).

As Figure  shows, tests constitute only a small set of options, among a wide range of other options, for a language teacher to make decisions about students. The judgment emanating from a test is not necessarily more valid or reliable from the one deriving from qualitative procedures since both should meet reliability or validity criteria to be considered as informed decisions. The area circumscribed within quantitative decision-making is relatively small and represents a specific choice made by the teacher at a particular time in the course while the vast area outside which covers all non-measurement qualitative assessment procedures represents the wider range of procedures and their general nature. This means that the qualitative approaches which result in descriptions of individuals, as contrasted to quantitative approaches which result in numbers, can go hand in hand with the teaching and learning experiences in the class and they can reveal more subtle shades of students’ proficiency. This in turn can lead to more illuminating insight about future progress and attainment of goals. However, the options discussed above are not a matter of either-or (traditional vs. alternative assessment) rather the language teacher is free to choose the one alternative (among alternatives in assessment) which best suits the particular moment in his particular class for particular students.

CONCLUSION :

Based on the above discussion, grading grading could be considered a component of assessment, i.e., a formal, summative, final and product-oriented judgment of overall quality of worth of a student's performance or achievement in a particular educational activity, e.g., a course. Generally, grading also employs a comparative standard of measurement and sets up a competitive relationship between those receiving the grades. Most proponents of assessment, however, would argue that grading and assessment are two different things, or at least opposite pole on the evaluation spectrum. For them, assessment measures student growth and progress on an individual basis, emphasizing informal, formative, process-oriented reflective feedback and communication between student and teacher. Ultimately, which conception you supports probably depends more on your teaching philosophy than anything else.
REFERENCES:
Ary, D., Jacobs, L. C. and Razavieh, A. (1996). Introduction to Research in Education. New York: Harcourt Brace College Publishers.
Aschbacher, P. A. (1991). Performance assessment: State activity, interest, and concerns. Applied Measurement in Education, 4(4), 275-288.
Bachman, L.F. (1990). Fundamental Considerations in Language Testing. Oxford: Oxford University Press.
Bachman, L.F. and Palmer, A.S. (1996). Language Testing in Practice: Designing and Developing Useful Language Tests. Oxford: Oxford University Press.
Brown, J.D. and Hudson, T. (1998). The alternatives in language assessment. TESOL Quarterly 32, 653–75.
Carroll, J. B. (1968). ‘The psychology of language testing’ in Davies, A. (ed.) Language Testing Symposium. A Psycholinguistic Perspective. London: Oxford University Press. 46-69.
Herman, J.L., Aschbacher, P.R., & Winters, L. (1992). A practical guide to alternative assessment. Alexandria, VA: Association for Supervision and Curriculum Development.


ii) Explain concept and types of validity with examples. (250 words)

INTRODUCTION :
Validity is the extent to which a test measures what it claims to measure. It is vital for a test to be valid in order for the results to be accurately applied and interpreted. Validity isn’t determined by a single statistic, but by a body of research that demonstrates the relationship between the test and the behavior it is intended to measure.
TYPES OF VALIDITY WITH EXAMPLES :
There are three types of validity:

1 Test validity
  •  Reliability and validity
  •  Construct validity
v     Convergent validity
v     Discriminant validity
  • Content validity
v     Representation validity
v     Face validity
  • Criterion validity
v     Concurrent validity
v     Predictive validity
2 Experimental validity
  • Conclusion validity
  • Internal validity
  • Intentional validity
  • External validity
v     Ecological validity
v     The relationship of external and internal validity

3 Diagnostic validity




1. Test validity

  •  Reliability and validity :  
Validity is often assessed along with reliability - the extent to which a measurement gives consistent results. An early definition of test validity identified it with the degree of correlation between the test and a criterion.

  •  Construct validity
Construct validity refers to the extent to which operationalizations of a construct (e.g. practical tests developed from a theory) do actually measure what the theory says they do. For example, to what extent is an IQ questionnaire actually measuring "intelligence"?

v     Convergent validity  :  Convergent validity refers to the degree to which a measure is correlated with other measures that it is theoretically predicted to correlate with.
v     Discriminant validity  : Discriminant validity describes the degree to which the operationalization does not correlate with other operationalizations that it theoretically should not be correlated with.


·        Content validity
Content validity evidence involves the degree to which the content of the test matches a content domain associated with the construct. For example, a test of the ability to add two numbers should include a range of combinations of digits. A test with only one-digit numbers, or only even numbers, would not have good coverage of the content domain. Content related evidence typically involves subject matter experts (SME's) evaluating test items against the test specifications.
v     Representation validity  : Representation validity, also known as    
translation validity, is about the extent to which an abstract theoretical construct can be turned into a specific practical test
v     Face validity  :     Face validity is an estimate of whether a test appears to measure a certain criterion; it does not guarantee that the test actually measures phenomena in that domain. Indeed, when a test is subject to faking (malingering), low face validity might make the test more valid.

  • Criterion validity
Criterion validity evidence involves the correlation between the test and a criterion variable (or variables) taken as representative of the construct. In other words, it compares the test with other measures or outcomes (the criteria) already held to be valid. For example, employee selection tests are often validated against measures of job performance (the criterion), and IQ tests are often validated against measures of academic performance (the criterion).

v     Concurrent validity : Concurrent validity refers to the degree to which the operationalization correlates with other measures of the same construct that are measured at the same time.
v     Predictive validity : Predictive validity refers to the degree to which the operationalization can predict (or correlate with) other measures of the same construct that are measured at some time in the future. Again, with the selection test example, this would mean that the tests are administered to applicants, all applicants are hired, their performance is reviewed at a later time, and then their scores on the two measures are correlated.


2 Experimental validity
The validity of the design of experimental research studies is a fundamental part of the scientific method, and a concern of research ethics. Without a valid design, valid scientific conclusions cannot be drawn.

·         Conclusion validity :
One aspect of the validity of a study is statistical conclusion validity - the degree to which conclusions reached about relationships between variables are justified. This involves ensuring adequate sampling procedures, appropriate statistical tests, and reliable measurement procedures.
  • Internal validity :
Eight kinds of confounding variable can interfere with internal validity (i.e. with the attempt to isolate causal relationships):
  1. History, the specific events occurring between the first and second measurements in addition to the experimental variables
  2. Maturation, processes within the participants as a function of the passage of time (not specific to particular events), e.g., growing older, hungrier, more tired, and so on.
  3. Testing, the effects of taking a test upon the scores of a second testing.
  4. Instrumentation, changes in calibration of a measurement tool or changes in the observers or scorers may produce changes in the obtained measurements.
  5. Statistical regression, operating where groups have been selected on the basis of their extreme scores.
  6. Selection, biases resulting from differential selection of respondents for the comparison groups.
  7. Experimental mortality, or differential loss of respondents from the comparison groups.
  8. Selection-maturation interaction, etc. e.g., in multiple-group quasi-experimental designs
·        Intentional validity :
To what extent did the chosen constructs and measures adequately assess what the study intended to study?
  • External validity
A major factor in this is whether the study sample (e.g. the research participants) are representative of the general population along relevant dimensions. Other factors jeopardizing external validity are:
  1. Reactive or interaction effect of testing, a pretest might increase the scores on a posttest
  2. Interaction effects of selection biases and the experimental variable.
  3. Reactive effects of experimental arrangements, which would preclude generalization about the effect of the experimental variable upon persons being exposed to it in non-experimental settings
  4. Multiple-treatment interference, where effects of earlier treatments are not erasable.

3 Diagnostic validity

In psychiatry there is a particular issue with assessing the validity of the diagnostic categories themselves. In this context:
  • content validity may refer to symptoms and diagnostic criteria;
  • concurrent validity may be defined by various correlates or markers, and perhaps also treatment response;
  • predictive validity may refer mainly to diagnostic stability over time;
  • discriminant validity may involve delimitation from other disorders.
CONCLUSION :
In some instances where a test measures a trait that is difficult to define, an expert judge may rate each item’s relevance. Because each judge is basing their rating on opinion, two independent judges rate the test separately. Items that are rated as strongly relevant by both judges will be included in the final test.

REFERENCE :

  1. American Educational Research Association, Psychological Association, & National Council on Measurement in Education. (1999). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association.
  2. B├╝ttner, J (1997). "Diagnostic validity as a theoretical concept and as a measurable quantity". Clinica chimica acta; international journal of clinical chemistry 260 (2): 131–43.