Yes, and they’ll use legislation to pull up the ladder behind them. It’s a form of Regulatory Capture, and it will absolutely lock out small players.
But there are open source AI training datasets, but the question is whether LLMs can be trained as accurately with them.
Thanks for the link to Common Crawl; I didn’t know about that project but it looks interesting.
That’s also an interesting point about heavily curated data sets. Would something like that be able to overcome some of the bias in current models? For example, if you were training a facial recognition model, access a curated, open source dataset that has representative samples of all races and genders to try and reduce the racial bias. Anyone training a facial recognition model for any purpose could have a training set that can be peer reviewed for accuracy.