The work of an Irish-based researcher has prompted a leading global university to delete an 80-million image library containing thousands of images labelled with racist and misogynistic insults and other derogatory terms.
Massachusetts Institute of Technology (MIT) has withdrawn its much-cited 80 Million Tiny Images database, which has been used to train Artificial Intelligence (AI) and Machine Learning (ML) systems.
MIT has asked researchers and developers to cease using the library to train AI and ML systems.
The dataset was found to contain unacceptable offensive labels and slur terms.
MIT’s decision came as a direct result of work involving Abeba Birhane, a University College Dublin-based researcher for Lero, the Science Foundation Ireland Research Centre for Software.
Ms Birhane’s work was a collaboration with Vinay Prabhu, chief scientist at UnifyID, a privacy start-up in Silicon Valley.
She said that linking images to slurs and offensive language infused prejudice and bias into AI and ML models, perpetuating stereotypes and prejudices.
“Not only is it unacceptable to label people’s images with offensive terms without their awareness and consent, training and validating AI systems with such dataset raises grave problems in the age of ubiquitous AI,” she said.
Ms Birhane, a PhD student, said when such systems are deployed into the real-world – in security, hiring, or policing systems – the consequences were “dire, resulting in individuals being denied opportunities or labelled as a criminal.
“More fundamentally, the practice of labelling a person based on their appearance risks reviving the long discredited pseudo-scientific practice of physiognomy.”
The 80 Million Tiny Images dataset is one of the Large Scale Vision Datasets (LSVD) and there are many others in use around the world.
Ms Birhane said lack of scrutiny had “played a role in the creation of monstrous and secretive datasets without much resistance, prompting further questions such as: what other secretive datasets currently exist hidden and guarded under the guise of proprietary assets?”
The researchers also found that all of the images used to populate the datasets examined were “non-consensual” images, included those of children, scraped from seven image search engines, including Google.
They argued that, in the age of Big Data, the fundamentals of informed consent, privacy, or agency of the individual “have gradually been eroded.
“Institutions, academia, and industry alike, amass millions of images of people without consent and often for unstated purposes under the guise of anonymisation, a claim that is both ephemeral and vacuous.”
They said their goal was to bring awareness to the AI and ML community regarding the severity of the threats from ill-considered datasets and their direct and indirect impact of their work on society, especially on vulnerable groups.