AI research papers typically report only aggregate results, without the granular detail that will allow other researchers to spot important issues like errors in recognizing certain faces on racial and gender lines. (image: iStock)

Suppose that an artificial intelligence algorithm distinguishes between female and male faces with 90 percent accuracy. Sounds impressive, right? But now, suppose that that algorithm is wrong 34.5 percent of the time for darker-skinned female faces, while erring on only 0.8 percent of lighter male faces. That’s a big problem — but right now, AI research papers typically report only aggregate results, without the granular detail that will allow other researchers to spot issues like these.

SFI Professor Melanie Mitchell coauthored a paper, published in Science on April 13pointing out this problem and proposing solutions.

The problem of aggregation is made worse in the case of models like ChatGPT, because the system doesn’t have a single, clearly defined goal. Benchmarks like “Beyond the Imitation Game” have been developed for such models, combining more than 200 tasks. A particular score on that benchmark tells researchers little about the strengths or weaknesses of a given model. Furthermore, the culture of AI centers on outdoing the current state-of-the-art performance rather than carefully understanding existing models.

Mitchell and her colleagues propose two primary solutions. The first is that scientific journals should require far more granular analyses of the performance of AI models, revealing how well they do on all relevant subgroups. This is essential for understanding a model’s behavior: For example, one computer vision system distinguished between objects like ships and horses with high precision — but analysis showed that it knew nothing about ships or horses and was recognizing features of the surrounding background or watermarks naming the image’s source — features that wouldn’t help in the real world.

The second recommendation is that data should be released showing the model’s results on every instance it’s tested on, so that outside researchers can do further analyses.

Mitchell acknowledges that this is just a start. Because so much AI development is happening in industry rather than academia, changing publication practices can’t do all that’s needed. 

“There’s a lot of discussion of whether AI systems should go through regulatory approval like we have for medical products, where the FDA requires that certain tests or studies be done,” Mitchell says. “Perhaps that’s the next step for machine learning products being deployed in the world.”

Read the paper, "Rethink reporting of evaluation results in AI" in Science (April 14, 2023). DOI: 10.1126/science.adf6369


NSF Grant Award No. 2020103 "AI Institute: Planning: Foundations of Intelligence in Natural and Artificial Systems"