Some Ideas for the Conference Paper

Maria Araceli Ruiz-Primo

One of the issues discussed in the conference for scaling-up was the need of evidence about the effectiveness of instructional materials or professional development programs. Furthermore, it was mentioned that assessment should play an important role in this scaling-up process. I propose two ideas to consider for the conference summary paper: (1) the need to conduct program evaluation of both instructional material and professional development; and (2) the need to bring into play a different approach to collect assessment information about student learning.

On the Need of Program Evaluation

What the field needs is an understanding of the process involved in the development and implementation of successful instructional materials and professional development programs that have proved to be effective in achieving their goals. I argue that the practice of program evaluation is a strategy to learn more about the design and development of successful instructional materials and teacher enhancement programs (Ruiz-Primo, 1994).

The reasoning behind is that a central task of program evaluation is to facilitate the transfer of knowledge from some program or sites to other programs or sites by explaining the processes that lead to achieve the outcomes (e.g., Cronbach, 1982; Ruiz-Primo, 1994). Particularly, formative evaluation helps program developers (of instructional materials or professional development) to better understand how, why, and in which context the instructional material or the professional development program is a success or a failure. It helps to specify what aspects of the program are relatively more successful than others, and among which groups of participants (e.g., Cronbach et al., 1980). Formative evaluation should help to accumulate knowledge about how effective programs are developed and adapted (Ruiz-Primo, 1994). Formative evaluation should capture information related to the intrinsic value of the program - the likelihood to meet/achieve the program goals, as well as information related to its potential dissemination - how generalizable is the program in other settings (e.g., Weiss, 1972; Ruiz-Primo, 1994).

I think that the point that needs to be made is that collection of information (qualitative and quantitative) is much needed to understand better why something works and under which conditions. I have proposed an approach to conduct formative evaluation that could provide information about the intrinsic value and the generalizability of programs. The approach characterizes a program (instructional material or professional development) as a system of interrelated components - context, goals, materials, delivery/implementation, and outcomes - which develop through three stages of maturity: (1) the planned program - the turn of an idea into a program for action; (2) the experimental program - a trail program to see what the program can accomplish, and (3) the prototype program - a model program that attempts to preview what will happen when the program is fully operational or scaled-up. In this approach the formative evaluation process is conceptualized as iterative process in which the program's goals are realized through successive approximations. The characteristics of the iterative process vary according to the development stage: from program reviews and revisions at the planned-program stage to program tryouts at different sites at the prototype program stage. (I have a picture that portrays this process.)

For scaling-up instructional materials or professional development programs, information collected at the prototype program stage is critical. In this stage, formative evaluation provides information on the adaptations needed to increase the probability of success when the program is fully operational. A central evaluation task is to study how implementation and outcomes vary from site to site. Since the reproducibility of program results in different sites depends, in part, on how well the enactment of the program is described, evaluation findings also focus on identifying how the variations observed across sites are related to the characteristics of the program material and how adapting them might narrow these variations.

To promote the adoption and implementation of instructional materials and/or professional development programs, it is necessary to have information about how the programs are having an impact on student performance. In both types of program, instructional materials and professional development, information about the implementation and outcomes are of great importance at the prototype stage of development. At the end, what counts is that programs demonstrate a measurable difference in student learning. However, a problem is that even if researchers and practitioners seek to document influences on student learning, they are often unable to find adequate measures of learning.

On the Need to Use Assessments at Different Proximities of the Curriculum Implemented

It has been argued that the statewide assessments students take may not be directly tied to the curriculum they are studying. However, statewide/nationwide assessments avoid, by design, special topics of concentration on specific subject matter taught to only a fraction of the students being tested. This situation sets up a tension between the knowledge and competencies students are able to demonstrate on a particular assessment and those they may have which the test does not in fact probe (e.g., Raizen, Baron, Champagne, Haertel, Mullis, & Oakes, 1989).

To address this tension Ruiz-Primo, Shavelson, Hamilton, and Klein (in press) have proposed a multilevel approach to evaluating the impact of education reform on student achievement that would be sensitive to context and small "treatment" effects. The approach uses different assessments based on their proximity to the enacted curriculum. Immediate assessments are artifacts (students' products such as science notebooks) from the enactment of the curriculum; close assessments parallel the content and activities of the unit/curriculum; proximal assessments tap knowledge and skills relevant to the curriculum, but topics can be different; and distal assessments reflect state/national standards in a particular knowledge domain (I have another picture for this piece). They provide evidence that the approach was suitable. Overall the results they found were in the predicted direction: close assessments were more sensitive to changes in student performance, whereas proximal assessments did not show as much impact of instruction. These results were replicated across two FOSS instructional units, and, in general, across classrooms. However, high between-class variation in effect sizes across classrooms suggested that the effect was not uniform and students "opportunity to learn" greatly varied.

The other point I think is important to make in the paper is that for an instructional material to prove an effect on students learning it requires to collect information on students performance using assessment of different characteristics. One should expect a higher effect when the assessments used are developed based on the curriculum students studied than when more distal assessments are used. Still, if one only observes an effect on close assessments, the education reform efforts are in question.