The trouble is, assembling a data set like ImageNet by hand takes a lot of time and effort. The images are typically labeled by low-paid crowdworkers. Data sets might also contain sexist or racist labels that can bias a model in hidden ways, as well as images of people who have been included without their consent. There’s evidence these biases can creep in even in pretraining.

How bits work

You’ve probably heard before that computers store things in 1s and 0s. These fundamental units of information are known as bits. When a bit is “on,” it corresponds with a 1; when it’s “off,” it turns into a 0. Each bit, in other words, can store only two pieces of information.

But once you string them together, the amount of information you can encode grows exponentially. Two bits can represent four pieces of information because there are 2^2 combinations: 00, 01, 10, and 11. Four bits can represent

