A UCD student's research has resulted in the withdrawal of an 80-million image library used to train artificial intelligence systems.
The research, by PhD student Abeba Birhane, found hundreds of millions of images in academic datasets that are used to develop AI systems and applications are partly based on racist and misogynistic labels and slurs. That's according to Lero, the Irish Software Research Centre, and University College Dublin's Complex Software Lab.
"Already, MIT has deleted its much-cited '80 Million Tiny Images' dataset, asking researchers and developers to cease using the library to train AI and ML system," said the software research centre in a statement. "MIT's decision came as a direct result of the research carried out by University College Dublin-based Lero researcher Abeba Birhane and Vinay Prabhu, chief scientist at UnifyID, a privacy startup in Silicon Valley."
In the course of the work, the Lero statement says, Ms Birhane found the MIT database contained thousands of images labelled with racist and misogynistic insults and derogatory terms.
This "contaminates" the AI databases, Ms Birhane said.
"Face recognition systems built on such dataset embed harmful stereotypes and prejudices," she said.
"Not only is it unacceptable to label people's images with offensive terms without their awareness and consent, training and validating AI systems with such dataset raises grave problems in the age of ubiquitous AI.
"When such systems are deployed into the real world, in security, hiring or policing systems, the consequences are dire, resulting in individuals being denied opportunities or labelled as a criminal.
"More fundamentally, the practice of labelling a person based on their appearance risks reviving the long discredited pseudo-scientific practice of physiognomy."
There are many datasets around the world which might be affected, she said.
"Lack of scrutiny has played a role in the creation of monstrous and secretive datasets without much resistance, prompting further questions such as what other secretive datasets currently exist hidden and guarded under the guise of proprietary assets," she said.
The researchers also found all of the images used to populate the datasets examined were "non-consensual" images, included those of children, scraped from seven image search engines, including Google.
"From the questionable ways images were sourced, to troublesome labelling of people in images, to the downstream effects of training AI models using such images, large-scale vision datasets may do more harm than good," said Ms Birhane.
"I would urge the machine learning community to pay close attention to the direct and indirect impact of our work on society, especially on vulnerable groups," she added.