Photos of Australian children have been included in the dataset used by several AI image-generating tools without the knowledge or consent of them or their families, research by Human Rights Watch (HRW) has found.
An analysis of less than 0.0001% of the 5.85bn images contained in the Laion-5B dataset, used by services such as Stable Diffusion creator Stability AI and Midjourney, found 190 photos of Australian children scraped from the internet.
Laion-5B has been built by scraping photos off the internet. Germany based Laion doesn’t keep a repository of all of the images it scrapes from the internet, but it contains a list of URLs to the original images, along with the alternate text included on those linked images.
HRW found children whose images were in the dataset were easily identifiable, with some names included in the accompanying caption or the URL where the image was stored.
It also included information on when and where the photo was taken.
One photo found featured two boys in front of a colourful mural, which reveals their names, ages, and the preschool they attended, HRW said, which was information not found anywhere else on the internet.
Hye Jung Han, HRW’s children’s rights and technology researcher, told Guardian Australia the photos were being lifted from photo and video sharing sites, as well as school websites.
“These are not easily findable on school websites,” she said. “They might have been taking images of a school event or like a dance performance or swim meet and wanted a way to share these images with parents and kids.
“It’s not quite a password-protected part of their website, but it’s a part of the website that is not publicly accessible, unless you were sent the link.
“These were not webpages that were indexed by Google.”
HRW also found an unlisted YouTube video of schoolies celebrations in the dataset. Such videos are not searchable on YouTube and scraping YouTube is against its policies, Han said.
Images of Indigenous children were also found, with some photos over a decade old. Han said this raised questions about how images of recently deceased Indigenous people could be protected if they were included in the dataset being used to train AI.
Laion, the organisation behind the open source dataset, was approached for comment.
The organisation has a form where users can submit feedback on issues in the dataset. According to the HRW, Laion confirmed last month that the personal photos were included and pledged to remove them, but said children and their guardians were ultimately responsible for removing personal photos from the internet.
Han said that the practice risks harming two groups of children as a result – those who have their photos scraped; and those who potentially have malicious AI tools, such as deepfake apps built on the dataset, used against them.
“Almost all of these free nudify apps have been built on Laion-5B because it is the biggest image and text and training dataset out there,” she said.
“It’s being used by untold numbers of AI developers, and some of those apps were specifically being used to cause harm to children.”
Last month, a teenage boy was arrested then released after nude images, created by AI using the likeness of about 50 female students from Bacchus Marsh Grammar, were circulated online.
The federal government in June introduced legislation to ban the creation and sharing of deepfake pornography, but HRW argued this failed to address the deeper problem that children’s personal data was unprotected from misuse, including where real children’s likeness can be used in deepfakes.
“No one knows how AI is going to evolve tomorrow. I think the root of the harm lays in the fact that children’s personal data are not legally protected, and so they’re not protected from misuse by any actor or any type of technology,” Han said.
The organisation said this should be addressed in legislation to update the Privacy Act, expected in August. HRW said this should prohibit scraping of children’s data into AI, and prohibit the nonconsensual digital replication or manipulation of children’s likeness.
The Australian privacy commissioner in 2021 found Clearview AI’s scraping of images from social media in the use of facial recognition technology “may adversely impact the personal freedoms of all Australians” and the company had breached Australians’ privacy.
Han said it was a strong statement, but now needed to be backed up by law and enforcement of that law.
“There’s still a long way to go.”