For eight years, I taught part-time in the Electrical Engineering graduate program at Northwestern University. The MSIT program is for a professional degree, and focuses developing students’ technical and business skills. It is very selective, the students are quite good, and it was a pleasure for me to be part of it for so long.
Thinking about grades a lot lately, I’ve been data-mining eight years of raw scores to determine if there is anything to learn from them. I’m not sure that I’ll discover any new principle, but I’ll share the results as I go.
Each year my class consisted of anywhere from 20 to 33 students. Four out of the eight years I based the final grade on two exams, a midterm and a final, as well as four homeworks and class participation. The final grade was heavily based on the two exams. In 2000 I only gave two homeworks, in 2001 and 2002 I gave three homeworks, and in 2007 I assigned no graded homeworks.
Each year the class covered largely the same material. In latter years I moved faster and covered more topics in slightly less depth. Midterms and finals were similar year to year, often reusing the same questions, though most years I switched out at least 25% of each exam.
I gave model answers to all of the homeworks. I did not give model answers for the exam questions, but in latter years I did give “practice exams” that were fairly accurate views of real exams.
Most years, the exams consisted of 2-3 “mechanical” questions and 1-2 “essay” questions. Mechanical questions required solving a particular type of problem, such as IP subnetting and aggregation or determining the DNS servers involved in a query. Essay questions were open-ended, required some creativity and tended to make the student apply concepts and think “beyond the classroom” to analyze issues.
One of the first questions I had was how correlated was a student’s midterm exam grade to that student’s final exam grade? In other words, how accurate a predictor is a student’s performance on one exam of their performance on another exam? If the correlation is very high, then that would indicate that evaluating students with a single exam would be appropriate. If the correlation is low, that argues for using multiple exams to evaluate students.
Correlations per year are below:
- 2000: 0.18
- 2001: 0.77
- 2002: 0.81
- 2003: 0.32
- 2004: 0.61
- 2005: 0.65
- 2006: 0.62
- 2007: 0.72
Aside from 2000 and 2003, the correlations between midterm and final exam scores are quite high. This means that, for a significant number of students, looking at either the midterm or the final exam grade is a good predictor of how that student did on the other exam. However for a minority of students, midterm and final exam scores were quite skewed.
There could be any number of reasons for this discrepancy. The student had a bad day or didn’t get enough time to study. Or, they studied in general but missed a key topic on the exam. Or, they knew the material reasonably well, but wrote their answer poorly.
Of course there could be a degree of grader error as well, especially on the more subjective essay questions.
But what about 2000 and 2003? I may need to go back and look at the exams to see if I can shed any light on why the correlations are so low. 2000 was my first year teaching in the program and I remember giving a final that was much harder than the midterm. It is possible that the difficulty of the final threw off some of the students. But that doesn’t explain 2003. At this point I don’t have any hard facts as to why we see such small correlations in those years.
If there is anything to keep in mind as I continue this analysis is that it seems that it is more “fair” to given at least two major exams, as well as some other graded assignments. Putting all of my evaluation eggs in one basket would have changed the grades of a number of students.
For example, looking at 2002, the year with the highest correlation, if I had not given a midterm and based final letter grades just on the final exam, 15 of 30 students would have had a different final letter grade. However, the magnitude of this change would have been small - none would have changed more than half a letter grade (for example, a B to a B+ or a B+ to an A-).
This begs further analysis. Perhaps a job for tomorrow.