My intuition:
- There’re “genuine” instances of hapax legomena which probably have some semantic sense, e.g. a rare concept, a wordplay, an artistic invention, an ancient inside joke.
- There’s various noise because somebody let their cat on the keyboard, because OCR software failed in one small spot, because somebody was copying data using a noisy channel without error correction, because somebody had a headache and couldn’t be bothered, because whatever.
- Once a dataset is too big to be manually reviewed by experts, the amount of general noise is far far far larger than what you’re looking for. At the same time you can’t differentiate between the two using statistics alone. And if it was manually reviewed, the experts have probably published their findings, or at least told a few colleagues.
- Transformers are VERY data-hungry. They need enormous datasets.
So I don’t think this approach will help you a lot even for finding words and phrases. And everything I’ve said can be extended to semantic noise too, so your extended question also seems a hopeless endeavour when approached specifically with LLMs or big data analysis of text.
Maybe some Borges too?