Mathematics Curriculum Materials Analysis Reliability Study
Gerald Kulm Laura Grier 
AAAS  Project 2061 May 6, 1998 
Project 2061 is developing two initiatives and products for reviewing and reporting on the analysis of mathematics and science curriculum materials: (1) a print and CDROM tool, Resources for Science Literacy: Curriculum Materials Evaluation, and (2) a data base of reports on middle grades mathematics and science textbooks. For users to have confidence in the analysis procedure and in the published textbook reviews, the analysis must be reliable. That is, the procedure for producing reports must be one in which independent reviewers can come to similar judgments for similar reasons.
The Project 2061 procedure for analyzing mathematics curriculum materials and the support documents that clarify the procedure have been revised extensively based on feedback from the numerous educators who have used it. In addition, we have developed a list of indicators and a rating scheme to make the procedure more useful for producing ratings and reports. This study was carried out to determine the effectiveness of the procedure arid training materials in producing consistent, valid, and reliable ratings of middle grades mathematics textbooks.
Method
Raters: To help us test the reliability and validity of the revised procedure, we convened twelve of our most highly able analysts for a rater reliability study. Each of the raters had been trained to analyze middle grades mathematics materials. Some of the raters had been trained by Gerald Kulm, Project 2061, as part of a project directed by Bill Bush at the University of Kentucky. Others had been trained by Kulm as part of the Expert Panel effort of the U. S. Department of Education. This training was based on the Project 2061 procedure. Both of these experiences showed that the most reliable ratings are produced when analysts work in teams of two persons. In addition, the most valid ratings were produced by a team consisting of a middle grades mathematics teacher and a university mathematics or mathematics education faculty member or supervisor.
The analysts for the study were chosen with these factors in mind. Eight experienced mathematics teachers and six university mathematics education faculty were selected. The analysts received a $1500 consulting fee as well as expenses for (1) attending the training meeting and (2) submitting a satisfactorily completed report. The names and positions of the analysts are provided in Table 1.
Table 1. Mathematics Curriculum Analysts  

Diane Surati Mathematics Teacher Montpelier, VT 
Bill Kunnecke Mathematics Teacher Calvert City, KY 
Mark Deegan Mathematics Teacher Alexandria, VA 
Michele Crowley Mathematics Education Instructor Northern Kentucky University 
Kathleen Morris Mathematics Teacher Lorton, VA 
Sue P. Reehm Mathematics Education Professor Eastern Kentucky University 
Linda Hackett Mathematics Education Professor American University 
Peg Darcy Mathematics Teacher Louisville, KY 
Marshall Gordon Mathematics Teacher Columbia, MD 
Jan McDowell Mathematics Teacher Louisville, KY 
Alice Mikovch Mathematics Education Professor Western Kentucky University 
Faye Stevens Mathematics Teacher Cadiz, KY 
Training: The raters attended a threeday meeting in Washington, DC to become familiar with the revised procedure and to practice the rating criteria. Using a mathematics benchmark and a sixthgrade textbook, the training consisted of the following steps:
 After clarifying the benchmark, the participants identified sightings in the textbook. The sightings were discussed, then used for the remainder of the training session.
 For each Instructional Cluster criterion, analysts were guided in a discussion of the cluster, the criterion, and the rating indicators.
 Working in teams, the indicators were used to make a rating of each criterion. The ratings for the six teams were displayed, and the discrepancies were discussed as a way to strengthen understanding of the criteria, indicators, and rating procedure.
Following the meeting, the indicators and rating criteria that were unclear or inconsistent were modified to produce a final set of instructions and a rating form. The instructions, along with a full set of examples from textbooks, illustrated a range of materials rated from low to high on how well they addressed each criterion. This notebook was available to the analysts as they studied the procedure and when they returned home to do their own analysis and ratings.
Design: Following the training, two sets of middle grades mathematics materials were sent to the analysts: Transitions Mathematics and Connected Mathematics. For the latter material, two units from each of three mathematics strands were selected for rating. Each team rated two mathematics benchmarks, one conceptual and one skill, for each of the two sets of curriculum materials. Each of the three pairs of teams was assigned to one of three mathematics strands: Number, Geometry, and Algebra. The two teams for each strand rated the same benchmarks and the same materials independently. Table 2 summarizes the mathematics strands, materials, analysts, and benchmarks that were used in the study.
Rating: The analysis and rating was done during March, 1998. Team members were encouraged to consult with each other and to ask questions of the director. They were asked not to consult or communicate with the members of other teams, especially the team that was analyzing the same set of materials. Teams submitted reports that included the (1) sightings for each indicator, (2) the justifications for the sightings, (3) the rating [Met, Not Met, Unsure] of each indicator, (3) the overall rating of each criterion [High, Medium, Low, None], and a justification of the overall rating of each criterion. In all, 24 criteria across 7 instructional clusters were rated.
With the exception of two or three analysts, the ratings were completed within the month, with the remainder being completed within two more weeks. All of the reports were complete and useable in the study.
Table 2. Design of Reliability Study  

Strand  Analyst Teams  Materials  Benchmarks 
Number  Diane Surati Bill Kunnecke Mark Deegan Michele Crowley 
Connected Mathematics: Bits And Pieces I Connected Mathematics: Comparing And Scaling Transition Mathematics 
Concept 9A 68#5 The expression a/b can mean different things: a parts of size 1/b each, a divided by b, or a compared to b. Skill 12B 68#2 Use, interpret, and compare numbers in several equivalent forms such as integers, fractions, decimals, and percents. 
Geometry  Kathleen Morris Sue Reehm Linda Hackett Peg Darcy 
Connected Mathematics: Stretching And Shrinking Connected Mathematics: Looking For Pythagoras Transition Mathematics 
Concept 9C 68#l Some shapes have special properties: Triangular shapes tend to make structures rigid, and round shapes give the least possible boundary for a given amount of interior area. Shapes can match exactly or have the same shape in different sizes. Skill 12B 68#3 Calculate the circumference and areas of rectangles, triangles, and circles, and the volumes of rectangular solids 
Algebra  Marshall Gordon Jan McDowell Alice Mikovch Faye Stevens 
Connected Mathematics: Variables And Patterns Connected Mathematics: Thinking With Mathematical Models Transition Mathematics 
Concept 9B 68#3 Graphs can show a variety of relationships between two variables. As one variable increases uniformly, the other may do one of the following: increase or decrease steadily, increase or decrease faster and faster, get closer and closer to some limiting value, reach some intermediate maximum or minimum, alternately increase and decrease indefinitely, increase or decrease in steps, or do something different from any of these. Skill 11C 68#4 Symbolic equations can be used to summarize how the quantity of something changes over time or in response to other changes. 
Summary of Results
There are 24 criteria across the seven instructional clusters. Overall, six separate ratings were done for each of these criteria on each of the two materials, resulting in 288 ratings. The results are summarized in Table 3.
There were 34 disagreements that differed by more than one step on the 4point [High, Medium, Low, None] rating scale. Therefore, the overall percentage agreement was 88.2 percent. The percentage agreement on the two materials differed considerably. For Transition Mathematics, there were 29 out of 144 differences, resulting in a rater agreement of 79.9 percent. For Connected Mathematics, there were 5 out of 144 differences, which is a 96.6 percent rater agreement. A closer look at the individual benchmarks shows that 14 of the 34 rating differences were on conceptrelated benchmarks. This result is due primarily to the Transition Mathematics data, indicating that in this material, skillrelated benchmarks are more difficult to rate.
As shown in Table 3, some of the criteria appeared more difficult to rate, regardless of the benchmark or type of material. For example, there were difficulties in rating criteria 4.4 Connecting Ideas and 7.1 Teacher Content Learning for both materials. For Transition Mathematics, there were three rating differences on criterion 4.1 Building a Case. Cluster 4 Developing and Using Mathematical Ideas resulting in the greatest number of rater differences for Transition Mathematics.
Table 3. Summary of Rater Agreements and Differences in Ratings on Benchmarks  

Transition Mathematics  
Benchmarks  Rater agreement (%)  Criteria with disagreements > 1 
9A#5  83  4.1 5.2 6.2 7.1 
9C#l  96  4.2 
9B#3  75  4.1, 4.2, 4.4, 4.5 5.1 7.1 
12B#2  63  2.1 4.1,4.3,4.5 5.2,5.3 6.2 7.1, 72 
12B#3  100 

11C#4  63  1.3 2.1, 2.4 3.2 4.1, 4.3, 4.4 5.1 7.1 
Connected Mathematics  
Benchmarks  Rater agreement (%)  Criteria with disagreements > 1 
9A#5  100 

9C#1  100 

9B#3  88  2.2 4.4 7.1 
12B#2  100 

12B#3  100 

11C#4  92  2.2 4.4 
Kulm, G., Grier, L. 1998. Mathematics Curriculum Materials Analysis Reliability Study.