While artificial intelligence (AI) remains an emerging technology, today’s capabilities are already amazingly effective. Consider the ImageNet Large Scale Visual Recognition Challenge, which evaluated competing AI algorithms operating at large scale for object recognition and labeling. The challenge ran from 2010 through 2017, demonstrating rapid improvements in computer vision, the AI discipline focused on image recognition.
The competition concluded when the error rate for tested algorithms had fallen below 3%, which is less than human error rates for performing the same activity. At this point, there was little room for improvement as the algorithm was dependent on the somewhat uneven quality of the training data. In other words, future advances in AI are less about programming finesse and more about capturing better and more consistent training data.
In more complex tasks, such as case processing, humans may have different perspectives which lead them to disagree on how to classify more than a third of their cases. In these scenarios, the AI system cannot operate as the referee. Rather, experienced professionals and deep subject matter experts must identify the underlying issues and make the final determination.
In most government offices, this process can include inspections, audits and other forms of secondary review. In this way, experienced supervisors train their employees to make better decisions. If we could capture this multi-level review process in data, we could implement more scaleable AI systems ready to tackle more complex challenges.
If you learn nothing else from this blog, remember this: The most important step in adding AI to any process is correctly labeling the training data. Inter-annotator agreement (IAA), which assesses the rate of agreement across multiple human reviewers, is an effective means for determining this quality. This is often more complicated than it might seem.
Consider the popular nursery rhyme line, “the itsy-bitsy spider climbed up the water spout.” One person might identify “itsy-bitsy” as the character, while another chooses “spider” – did they agree? In a later line – “down came the rain and washed him away” – someone may neglect to label “him” despite recognizing “him” as the spider. This simple example highlights the challenges in correctly labeling training data.
Fortunately, these challenges can be measured and used to improve both the labeling of data for AI and the development of a more fair and consistent process. Identification of labeling or classification differences – the IAA score – can provide many benefits. Inconsistencies can be classified as either recognition error (e.g. ‘him’ refers to the spider) or classification issues (e.g. ‘him’ refers to the rain).
By having multiple people label data and measuring their level of agreement, supervisors can identify labelers who may need additional training against those who consistently “get it right.” In other words, implementing a system to monitor IAA is not just a prerequisite for developing an AI system, but also a powerful way to ensure more consistent and fair decision-making within government.
Clearly, human experts are needed to make determinations about how data should be annotated and what is deemed “correct.” While technology is needed to capture the data and measure performance, this isn’t simply a technology challenge. Rather, subject matter experts and technologists must work together to build and continuously improve a trustworthy AI solution.
Dominic Delmolino is a GovLoop Featured Contributor. He is the Chief Technology Officer at Accenture Federal Services and leads the development of Accenture federal’s technology strategy. He has been instrumental in establishing Accenture’s federal activities in the open source space and has played a key role in the business by fostering and facilitating federal communities of practice for cloud, DevOps, artificial intelligence and blockchain.
This post was co-authored by Ian McCulloh, PhD, Accenture Federal Services Chief Data Scientist.
It was clever to use the example of the nursery rhyme to illustrate the difficulty of labeling training data! Thank you for this piece.