Machine learning systems need informed (i.e., human) oversight
UPDATED: September 29, 2021
Don’t be surprised if your family doctor pulls out her cell phone to diagnose that unusual bump on your skin. She’s using an AI-assisted system, based on machine learning (ML) technology, to determine if the bump is potentially troublesome and needs a specialist’s eye. But what if the software itself is biased, because the data it trained on was biased improperly toward one race or another?
In new research, Professors Mike H.M. Teodorescu and Gerald C. Kane of the Boston College Carroll School of Management, along with Lily Morse of West Virginia University and Yazeed Awwad of MIT, propose that some measure of human oversight is necessary for any ML-reliant decision-making process. They refer to such oversight as “augmentation.”
In machine learning, networks of algorithms “learn” by looking for patterns in immense amounts of data. But if the training data is skewed—in the dermatology system, for instance, toward whites rather than Blacks, Latinos, or Asians—the output could be faulty. The researchers offer a matrix to guide organizations in instituting appropriate human augmentation to ensure that the ML systems they rely on are as fair and unbiased as possible.
Their article, “,” appeared in of MIS Quarterly as part of a special issue on “Managing AI.” In it, the authors propose that organizations add humans to the equation according to where the final decision is made—human or machine—and the complexity of the fairness criteria underpinning the ML system. Different scenarios dictate at what stage and to what extent human oversight or intervention is advised.
In the case of, say, fintech companies that use ML models to verify identities for validating loan worthiness, only what the authors call “reactive oversight” is needed—that is, where the ML model spits out an outcome that is examined by the human decision maker. If the model doesn’t routinely produce results that, post facto, are deemed in compliance with bias regulations, managers should then provide clearer fairness objectives to their development teams.
When fairness is more complicated and ML systems still make the final decision—for instance, in content filtering for online social media platforms—more human augmentation is called for. That is because the filters themselves could reflect their developers’ opinions, creating “ideological echo chambers.” Such models could even be manipulated by savvy social media users who flood the platform with their own views. In such a case, the authors suggest “proactive oversight” at an earlier stage: humans should manually vet feedback and guide the models “toward the pathways that come closest to meeting agreed-upon standards of right and wrong.”
In cases where final decisions rest with humans and fairness criteria are relatively simple, such as with the AI-assisted dermatology exam, the ML model provides "decision support." Educating the human decision-makers about how the software may be systematically unfair is sufficient in this situation.
Sometimes, however, recognizing that an ML system might be skewing outcomes but determining exactly how is impossible. An example is a platform that tracks job candidates’ behavior in video interviews to assess their employability. For all the hiring client knows, the system could simply be basing its calculation on past decisions—if a client company comprises mostly males, the model is more likely to recommend men. In these cases, either managers or even other automated systems should double-check the system’s recommendations. Â
Understanding the nuances of ML technology is critical, the authors conclude: “A robust research agenda regarding fairness and augmentation can help organizations more effectively leverage ML models’ benefits while limiting the potential adverse societal effects.”
Marilyn Harris is a reporter, writer, and editor with expertise in translating complex or technical material for online, print, and television audiences.