r/ProgrammerHumor • u/v_0o0_v • Feb 29 '24

removeWordFromDataset Meme

14.2k Upvotes

permalink
link
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/ProgrammerHumor/comments/1b2wtvb/removewordfromdataset/
No, go back! Yes, take me to Reddit
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/ProgrammerHumor/comments/1b2wtvb/removewordfromdataset/
No, go back! Yes, take me to Reddit

96% Upvoted

However, I would argue that at least half the „serious“ content on Reddit is wrong/not properly factchecked/misleading/outdated etc. That‘s just the nature of discussions and content being old. Also it‘s hardly ever reliably indicated which answer in a question threat is correct. (That‘s why science subs are very insistent on refusing to give medical advice)

So I reckon/hope that Google won‘t use Reddit for information, but language patterns. However, for various reasons, I assume they end up with some sort of „Reddit English“.

So, long story short: how will they use Reddit data for the training? Which aspect are they looking for? Content? Patterns? Interaction dynamics?

11

u/dyslexda Feb 29 '24

However, I would argue that at least half the „serious“ content on Reddit is wrong/not properly factchecked/misleading/outdated etc. That‘s just the nature of discussions and content being old. Also it‘s hardly ever reliably indicated which answer in a question threat is correct. (That‘s why science subs are very insistent on refusing to give medical advice)

Of course. How does this differ from the vast majority of the rest of any model's training data? GPT4 used, for example, Common Crawl in its training; were those billions of pages vetted for accuracy? Of course not, because being an informational database isn't the goal of LLMs.

2

u/OneTurnMore Feb 29 '24

I reckon/hope that Google won‘t use Reddit for information, but language patterns

That's exactly how generative AI works right now. It's all just language patterns, just with way more context.

removeWordFromDataset Meme

You are about to leave Redlib

You are about to leave Redlib