We’ve polluted the water. One active area that LLMs are being deployed is in reading scanned text, so my best guess is that the next few models are going to be trained on a new corpus of previously unscanned written text.
I’m talking legal documents from the 80s, company documents that were never digitized, and anythibg else google books hasnt fully OCR’d.
We’ve polluted the water. One active area that LLMs are being deployed is in reading scanned text, so my best guess is that the next few models are going to be trained on a new corpus of previously unscanned written text.
I’m talking legal documents from the 80s, company documents that were never digitized, and anythibg else google books hasnt fully OCR’d.