Some Ideas for the Conference Paper
Maria Araceli Ruiz-Primo
One of the issues discussed in the conference for scaling-up was the need of
evidence about the effectiveness of instructional materials or professional
development programs. Furthermore, it was mentioned that assessment should
play an important role in this scaling-up process. I propose two ideas to
consider for the conference summary paper: (1) the need to conduct program
evaluation of both instructional material and professional development; and
(2) the need to bring into play a different approach to collect assessment
information about student learning.
On the Need of Program Evaluation
What the field needs is an understanding of the process involved in the development
and implementation of successful instructional materials and professional
development programs that have proved to be effective in achieving their goals.
I argue that the practice of program evaluation is a strategy to learn more
about the design and development of successful instructional materials and
teacher enhancement programs (Ruiz-Primo, 1994).
The reasoning behind is that a central task of program evaluation is to facilitate
the transfer of knowledge from some program or sites to other programs or
sites by explaining the processes that lead to achieve the outcomes (e.g.,
Cronbach, 1982; Ruiz-Primo, 1994). Particularly, formative evaluation helps
program developers (of instructional materials or professional development)
to better understand how, why, and in which context
the instructional material or the professional development program is a success
or a failure. It helps to specify what aspects of the program are relatively
more successful than others, and among which groups of participants (e.g.,
Cronbach et al., 1980). Formative evaluation should help to accumulate knowledge
about how effective programs are developed and adapted (Ruiz-Primo, 1994).
Formative evaluation should capture information related to the intrinsic value
of the program - the likelihood to meet/achieve the program goals, as well
as information related to its potential dissemination - how generalizable
is the program in other settings (e.g., Weiss, 1972; Ruiz-Primo, 1994).
I think that the point that needs to be made is that collection of information
(qualitative and quantitative) is much needed to understand better why something
works and under which conditions. I have proposed an approach to conduct formative
evaluation that could provide information about the intrinsic value and the
generalizability of programs. The approach characterizes a program (instructional
material or professional development) as a system of interrelated components
- context, goals, materials, delivery/implementation, and outcomes - which
develop through three stages of maturity: (1) the planned program
- the turn of an idea into a program for action; (2) the experimental
program - a trail program to see what the program can accomplish, and
(3) the prototype program - a model program that attempts to preview
what will happen when the program is fully operational or scaled-up. In this
approach the formative evaluation process is conceptualized as iterative process
in which the program's goals are realized through successive approximations.
The characteristics of the iterative process vary according to the development
stage: from program reviews and revisions at the planned-program stage to
program tryouts at different sites at the prototype program stage. (I have
a picture that portrays this process.)
For scaling-up instructional materials or professional development programs,
information collected at the prototype program stage is critical. In this
stage, formative evaluation provides information on the adaptations needed
to increase the probability of success when the program is fully operational.
A central evaluation task is to study how implementation and outcomes vary
from site to site. Since the reproducibility of program results in different
sites depends, in part, on how well the enactment of the program is described,
evaluation findings also focus on identifying how the variations observed
across sites are related to the characteristics of the program material and
how adapting them might narrow these variations.
To promote the adoption and implementation of instructional materials and/or
professional development programs, it is necessary to have information about
how the programs are having an impact on student performance. In both types
of program, instructional materials and professional development, information
about the implementation and outcomes are of great importance
at the prototype stage of development. At the end, what counts is that programs
demonstrate a measurable difference in student learning. However, a problem
is that even if researchers and practitioners seek to document influences
on student learning, they are often unable to find adequate measures of learning.
On the Need to Use Assessments at Different Proximities of the Curriculum
Implemented
It has been argued that the statewide assessments students take may not be
directly tied to the curriculum they are studying. However, statewide/nationwide
assessments avoid, by design, special topics of concentration on specific
subject matter taught to only a fraction of the students being tested. This
situation sets up a tension between the knowledge and competencies students
are able to demonstrate on a particular assessment and those they may have
which the test does not in fact probe (e.g., Raizen, Baron, Champagne, Haertel,
Mullis, & Oakes, 1989).
To address this tension Ruiz-Primo, Shavelson, Hamilton, and Klein (in press)
have proposed a multilevel approach to evaluating the impact of education
reform on student achievement that would be sensitive to context and small
"treatment" effects. The approach uses different assessments based on their
proximity to the enacted curriculum. Immediate assessments are artifacts
(students' products such as science notebooks) from the enactment of
the curriculum; close assessments parallel the content and activities
of the unit/curriculum; proximal assessments tap knowledge and skills
relevant to the curriculum, but topics can be different; and distal
assessments reflect state/national standards in a particular knowledge domain
(I have another picture for this piece). They provide evidence that the approach
was suitable. Overall the results they found were in the predicted direction:
close assessments were more sensitive to changes in student performance, whereas
proximal assessments did not show as much impact of instruction. These results
were replicated across two FOSS instructional units, and, in general, across
classrooms. However, high between-class variation in effect sizes across classrooms
suggested that the effect was not uniform and students "opportunity to learn"
greatly varied.
The other point I think is important to make in the paper is that for an instructional
material to prove an effect on students learning it requires to collect information
on students performance using assessment of different characteristics. One
should expect a higher effect when the assessments used are developed based
on the curriculum students studied than when more distal assessments are used.
Still, if one only observes an effect on close assessments, the education
reform efforts are in question.