Our brains have 100 trillion connections. Large language models have up to half a trillion, a trillion at most. Yet GPT-4 knows hundreds of times more than any one person does.Geoffrey Hinton, deep learning pioneer
The recent surge in interest in generative AI was sparked by neural networks trained on high quality public, human culture.
Their use of this culture is extremely focussed – they only saw good quality inputs, and only saw each input once (see the paper One Epoch Is All You Need for why). If you show them lots of bad quality stuff, they’re not adaptive enough to tell and ignore it.
So what exactly makes the data they’re trained on “high quality”? I decided to dig into two examples — GPT-3 for text and Stable Diffusion for images — to find out. (I’m not an expert at training networks, so please leave comments about anything you see that I get wrong)
GPT-3 — good articles that Reddit linked to
GPT-3 feels a bit like it saw all the internet as input — actually they started by training another curator network to pick out the “high quality” parts of the internet (see 2.2 Training Dataset and Appendix A in the GPT-3 paper).
What is high quality? Well, the curator network for GPT-3 was taught its concept of high quality by looking at an internet corpus called WebText. That was made by grabbing a copy of all of Reddit, and looking at the outbound links in moderately upvoted posts/comments.
(Details: Read 2.2 Training Dataset in the GPT-2 paper — that says “at least 3 karma”, which doesn’t make clear sense to me. As far as I can tell it is Reddit users who have karma, not links or posts. OpenWebText2, a recreation of this dataset, used URLs from submissions which have a score of 3 or higher — see their documentation — which seems a good assumption for what GPT-3 did. Possibly they took links from posts by users with a karma greater than 3. Posts are also separate from comments, and users have a separate karma score for each.)
GPT-3 was also given lots of books and Wikipedia — but most (82%) of what it took in was the good pages that Reddit linked to, and other pages that “feel” similarly high quality.
Stable Diffusion — pretty competition-winning images
Once again, this begins with a copy of the whole Internet, filtering out all images that have alt-text attributes in the HTML using them. This is already a filter — well made sites which care about accessibility will have more and better alt-text. A project called LAION then classifies those images
using AI so they can be filtered by language, resolution, chance of having a watermark, and their aesthetic score.
Stable Diffusion’s training is in a complicated series of checkpoints, which starts off with a bit of general LAION data, but ends up mostly trained on highly aesthetic images. This is much the same as for GPT-3 — a curator AI (the LAION-Aesthetic Predictor V2) learns to judge high quality images. Those images from LAION that the predictor scores 5 or higher were used to train Stable Diffusion.
But what does an aeshetic score of 5 mean? For a quick feel, this page shows increasingly aesthetic buckets of images from the full LAION dataset as you go down the page. Digging into the training data used to make LAION, there are two main sources:
1. Simulacra Aesthetic Captions – manually rated images made from an earlier (non-aesthetically trained) checkpoint of Stable Diffusion. Humans, I assume from the Stable Diffusion community, rated a couple of hundred thousand images by hand.
2. Aesthetic Visual Analysis – this is a download of all the images from DPChallenge, a 20 year old digital photography competition site. A few times a week for decades this has competitions like “photograph bananas, any composition, must have at least one banana”. While they only have tens of entries to each competition these days, they used to have hundreds.
There’s a bit more complexity — a small number of logos specially aesthetically rated were thrown in, I think to improve typography and font quality. You can browse all the images used to train Stable Diffusion (browser by Andy and Simon, those links to their blog posts about it).
The notable common properties are:
- Each piece of training data is only shown once to the AI during training
- Both have a core dataset with some human-curated metric for “high quality”
- Both extended that core dataset by training a curator AI to pick out similar high quality items
It’s quite an odd situation, especially given the emergent properties of reasoning, coding and so on that these kinds of models have. The training mechanism isn’t particularly smart, the smart stuff emerges inside the neural networks so trained.
The AIs have learnt to think extremely well about a large chunk of human knowledge in symbolic form. Right now, they are heavily reliant on humans — every upvote you made on Reddit, and an old time niche digital photography competition site.
One thought on “What is high-quality about the data that trained generative AI?”
I downloaded WebText2 (27Gb compressed), and as a first thing searched for references to this blog (flourish.org) in it.
Main thing I learnt is that everything seems to be from a post on Reddit – not a comment. Also Reddit has (or some Subreddits have) a lot more posts that are just links than I thought.
These were the Reddit posts WebText2 contained that link to different articles on flourish.org.
* “TIL about Upside Down maps” https://www.reddit.com/r/todayilearned/comments/92pyd/til_about_upside_down_maps/
* “Astonishments, ten, in the history of version control” https://www.reddit.com/r/programming/comments/2tidaz/astonishments_ten_in_the_history_of_version/
* “Francis Irving: Why I’m collecting every MP candidate’s CV” https://www.reddit.com/r/unitedkingdom/comments/2zw9br/francis_irving_why_im_collecting_every_mp/
* “On finding political axes using maths” https://www.reddit.com/r/ukpolitics/comments/4utkqr/on_finding_political_axes_using_maths/
It’s kinda odd seeing parts of my own mind becoming part of the AI like this!