Guest Blogger: Dr. William L. Heller, Using Data Program Director, Teaching Matters*
Data-savvy investigators never make important decisions based on a single source. When teams following the Using Data process believe they may have found a student learning problem, based on their analysis of standardized testing results, they know to confirm the problem through an examination of student work and other common formative assessments. When they do this, it’s important for them to have a norming process in place to ensure that the data being generated is reliable and useful.
Norming is the process of calibrating the use of a single set of scoring criteria among multiple scorers. If norming is successful, a particular piece of work should receive the same score regardless of who is scoring it. With the advent of the Common Core State Standards Initiative, we may anticipate that curriculum-embedded performance tasks will begin to gain prominence over traditional multiple-choice tests, and it will be even more important for teachers to be aware of how to make the best use of these assessments. Whether or not they are rigorous about norming can make a very big difference.
Many years ago, I was an open-ended response scorer for the New Jersey State High School Proficiency Exam, a test students had to pass in order to graduate. My fellow scorers and I were trained on, and given a qualifying exam for, each question we scored. The exam consisted of twenty sample responses to that question. If we gave nineteen of them the correct score, we were cleared to work on that question. Once on the job, responses would show up on a computer screen (with no names, so it would be blind to gender and ethnicity), and we would type the numerical score on our keypads. Each response was graded by two scorers independently. If the two disagreed, it would get bumped up to a supervisor. We were evaluated by volume, and by how few times we were overturned. It was an incredibly efficient and reliable system.
Compare this process to the way the writing sections are currently scored on the New York State English Language Arts (ELA) Exam. Different sections of the state have different norming procedures, which means the state as a whole has none. I’ve talked with many New York City teachers who have scored the exam, and they report that there was very little effort to norm. Different scorers had wildly different standards for interpreting the rubric, and even the same scorer could become more lenient as the days went on. The final scores, then, were as much of a function of geography, timing, and luck as they were of student performance. How can we possibly make use of this data to reliably identify student learning problems, let alone make high-stakes decisions about school, teacher, or student performance?
Teacher teams have the opportunity to be smarter than this in the way they score their local assessments. Before any rubric-based scoring begins, the teachers involved should meet. They should each score the same piece of student work using a common rubric. They may then compare their scores, and use the comparison to guide a conversation about how the rubric will be used. Three such rounds can be fit comfortably within a common planning period. The goal is for the teachers to align their scoring practices with one another, so that scoring will be consistent and fair.
Norming can often be dismissed as extra work for an already-busy department. But without it, performance-based assessments will not yield reliable data. It’s good form to norm!
*Teaching Matters is a non-profit organization that partners with educators to ensure that all students can succeed in the digital age. They are an official TERC Using Data partner organization, conducting the Using Data for Meaningful Change institute for New York City schools.
January 10, 2013 at 6:35 pm
What are some good resources to use on how to norm? For example, what statistical analyses can be done on data from different scorers to determine how “off” people were?
January 11, 2013 at 11:04 am
Tell me more about the type of work being scored? Is it student work? Writing samples? Open-Response mathematics problems? The important part of the work is the conversations that teachers have in comparing their analyses – not the statistical analysis. Do they have the same understanding about what what constitutes “mastery” levels of performance? Do they have a common understanding of the learning progressions? What is the learning we’re looking for and what does it look like? Or, is your question relative to a different piece of work?
March 28, 2013 at 10:57 am
You’ll want to use Cohen’s Kappa as a measure of agreement. Can’t really take things below .3 very seriously. .4-.5 is pretty good for rubric grading for things that aren’t ridiculously standardized. Anything above that is great.
June 29, 2014 at 6:52 am
Please let me know if you’re looking for a article author for your weblog.
You have some really great articles and I believe I would be
a good asset. If you ever want to take some of the load off, I’d absolutely love to write some articles for your blog in exchange for
a link back to mine. Please send me an email if interested.
Thanks!
July 15, 2014 at 1:50 pm
Would it be possible for you to send us a link to your blog? Are you an educator? What is your experience with data literacy at the classroom level?
Mary Anne Mather
Using Data for Meaningful Change Managing Editor