Multiple Choice Proofs

Testing in the US is dominated by multiple-choice questions. Together with the time limit, this encourages students to stop thinking and go for guessing. I recently wrote an essay AMC, AIME, USAMO Contradiction, in which I complained about the lack of proofs in the first two rounds of math competitions.

Is there a way to improve the situation? I grew up in the USSR, where each round of the math competition had the same format: you were given several hours to write proofs for three or four difficult problems. There are two concerns with organizing a competition in this way. First, the Russian system is much more expensive, whereas the US’s multiple choice tests can be inexpensively checked by a computer. Second, the Russian system is prone to unfairness. You need many math teachers to check all these papers on the highest level. Some of these teachers might not be fully qualified, and it is difficult to ensure uniform checking. This system can’t easily be adopted in the US. I am surprised I haven’t heard of lawsuits challenging USAMO results, but if we were to start having proofs at the AMC level with several hundred thousand participants, we would get into lots of trouble.

An interesting compromise was introduced at the Streamline Olympiad. The problems were multiple choice, but students were also requested to write proofs. Students got two points for a correct multiple choice answer, and if the choice was correct the proof was checked. Students could get up to three points for a correct proof. This idea solves two issues. The writing of proofs is rewarded at an early stage and the work of the judges is not as overwhelming as it would have been, had they needed to check every proof. However, there is one problem that I discussed in previous posts that this method doesn’t solve: with multiple choice, minor mistakes cost you the whole problem, even though you might have been very close to a solution. If we want to reward thinking more than accuracy, the proof system allows us to give credit for partial solutions.

I can suggest another approach. If the Russians require proofs for all problems and the Americans don’t require proofs for any problem, why not compromise and require a proof for one problem out of the set.

But I actually have a bigger idea in mind. I think that current development in artificial intelligence may soon help us to check the proofs with the aid of a computer. Artificial intelligence is still far from ready to validate that a mathematical text a human has produced constitutes a proof. But in this particular case, we have two things working for us. First, we can use humans and computers together. Second, we do not need to check the validity of any random proof; we need to check the validity of a specific proof of a simple problem that we know in advance, thus allowing us to prepare the computers.

Let us assume that we already can convert student handwriting into computer-legible text or that students write directly in LaTeX.

Here is the plan. Suppose for every problem, we create a database of some sample right, wrong and partial solutions with corresponding scores. The computer checks the students’ solutions against the given sample. Hopefully, the computer can recognize small typos and deviations that shouldn’t change the point value. If the computer encounters a solution that is significantly different from the ones in the sample, it sends the solution to human judges. Humans decide how to score the solution and the solution and its score is added to the sample database.

For this system to work, computers should be smart enough not to send too many solutions to humans. So how many is too many? My estimate is based on the idea that we wouldn’t want the budget of AMC to go too much higher than the USAMO budget. Since USAMO has 500 participants, judges check just a few hundred solutions to any particular problem. With several hundred thousand participants in AMC, the computer would have to be able to cluster all the solutions into not more than a few hundred groups. The judges only have to check one solution in each group.

As a bonus, we can create a system where for a given solution that is not in the database, the computer finds the closest solution and highlights the difference, thus simplifying the human’s job.

In order to improve math education, we need to add proofs when teaching math. My idea might also work for SATs and for other tests.

Now that there is more money available for education research, would anyone like to explore this?



  1. Maria Roginskaya:

    It is not true that all rounds of MO in USSR were the same. St.Petersburg city (region) level competition was oral (normally done in two days as in each of 6 ages about 100 students would be invited). Also in the late years in the grades 8-10 “profy” (i.e. students who were in special mathematical programs) would have a separate section, so that students who haven’t been taught advanced mathematics would still have some chances to a good result (as both sections were in the same day students from advanced program were not tempted to go to non-advanced section, to win against less prepared contestants, which anyways would be frown upon).

    Another, less obvious, feature of St.Petersburg (both region and district) was that any problem was graded only as “+” or “-” (one could mark “+-” or “-+” but it still would be counted as “+” or “-” in the end). This mean that the competition was less diffirentiating, so for the purpose to choose the team for All-Russian there was a separate competition – “Otbor”~”Sieve” (after the “city”-level), on which were invited top ~20 students from grades 8,9,10 (and also sometimes a couple of top 7-th grades and those who did very well on it the previous year but by some reason missed the “city”).

    On the other hand, there were less questions about fairness of marking: For the district (written) round there were an “conflict”-day, when a student could come and try to convinced a jury member that what s/he have written was a proof. As the written test were graded first in the districts, and then briefly revised by city level jury we hadn’t more than 1-2 mistakes found a year (with about 7000 papers). As far as I recall, “revision” of one grade (~1200 papers) would take a “long” day (9am.-midnight) for 2-3 experienced jury members (though we all were young and quick at that time :-)).

    This system seems to be the most expensive, though (unless one count that the volonteering teaches and math. grads and undergrads jury members cost nothing – we have had about 30 volonteers for a day of oral competition, and 5-10 highschool teachers in major districts).

  2. Felipe Pait:

    I would not put multiple choice against written exams as an issue of which is more expensive. The main point is that you have a certain amount of a teaching resource. Do you want teachers to spend time with students, or grading exams? That said, computer-aided grading can be very useful in making the chore of grading more efficient, but I doubt artificial intelligence is anywhere near what would be helpful.

    I have seniors and grad students grade themselves, and each other. Spot checking, an interview, and an open eye are mostly enough to prevent gross errors of evaluation or outright cheating. For mass testing of elementary and high school students, a two-phase test with a combination of different types of question seems to me the best way for the time being.

Leave a comment