AI Dataset for Detecting Nudity Contained Child Sexual Abuse Images

A large image dataset used to develop AI tools for detecting nudity contains a number of images of child sexual abuse material (CSAM), according to the Canadian Centre for Child Protection (C3P).

The NudeNet dataset, which contains more than 700,000 images scraped from the internet, was used to train an AI image classifier which could automatically detect nudity in an image. C3P found that more than 250 academic works either cited or used the NudeNet dataset since it was available download from Academic Torrents, a platform for sharing research data, in June 2019.

“A non-exhaustive review of 50 of these academic projects found 13 made use of the NudeNet data set, and 29 relied on the NudeNet classifier or model,” C3P said in its announcement.

C3P found more than 120 images of identified or known victims of CSAM in the dataset, including nearly 70 images focused on the genital or anal area of children who are confirmed or appear to be pre-pubescent. “In some cases, images depicting sexual or abusive acts involving children and teenagers such as fellatio or penile-vaginal penetration,” C3P said.

People and organizations that downloaded the dataset would have no way of knowing it contained CSAM unless they went looking for it, and most likely they did not, but having those images on their machines would be technically criminal.

“CSAM is illegal and hosting and distributing creates huge liabilities for the creators and researchers. There is also a larger ethical issue here in that the victims in these images have almost certainly not consented to have these images distributed and used in training,” Hany Farid, a professor at UC Berkeley and one of the world’s leading experts on digitally manipulated images, told me in an email. Farid also developed PhotoDNA, a widely used image-identification and content filtering tool. “Even if the ends are noble, they don’t justify the means in this case.”

“Many of the AI models used to support features in applications and research initiatives have been trained on data that has been collected indiscriminately or in ethically questionable ways. This lack of due diligence has led to the appearance of known child sexual abuse and exploitation material in these types of datasets, something that is largely preventable,” Lloyd Richardson, C3P’s director of technology, said.

Academic Torrents removed the dataset after C3P issued a removal notice to its administrators.

C3P’s findings are similar to 2023 research from Stanford University’s Cyber Policy Center, which found that LAION-5B, one of the largest datasets powering AI-generated images, also contained CSAM. The organization that manages LAION-5B removed it from the internet following that report and only shared it again once it had removed the offending images.

“As countries continue to invest in the development of AI technology, it’s crucial that researchers and industry consider the ethics of their work every step of the way,” Richardson said.

Related Posts