On GPT-3

GPT-3 is scary because it’s a tiny model compared to what’s possible, with a simple uniform architecture trained in the dumbest way possible (prediction of next text token) on a single impoverished modality (random Internet text dumps) on tiny data (fits on a laptop), and yet, the first version already manifests crazy runtime meta-learning—and the scaling curves still are not bending! The samples are also better than ever, whether it’s GPT-3 inventing new dick jokes or writing (mostly working) JavaScript tutorials about rotating arrays. Does it set SOTA on every task? No, of course not. But the question is not whether we can lawyerly find any way in which it might not work, but whether there is any way which it might work.

Leave a comment