Hi Remmelt, thanks for sharing these thoughts! I actually generally agree that mitigating and avoiding harms from AI should involve broad democratic participation rather than narrow technical focus—it reminded me a lot of Gideon’s post “We are fighting a shared battle”. So view the questions below as more nitpicks, as I mostly agreed with the ‘vibe’ of your post.
AI companies have scraped so much personal data, that they are breaking laws.
Quick question for my understanding, do the major labs actually do their own scraping, or do other companies do the scraping which the major AI labs pay for? I’m thinking of the use of Common Crawl to train LLMs here for instance. It potentially might affect the legal angle, though that’s not my area of expertise.
Again, for my clarification, what do you think about this article? I have my own thoughts but would like to hear your take.
AI Ethics researchers have been supporting creatives, but lack funds. AI Safety has watched on, but could step in to alleviate the bottleneck. Empowering creatives is a first step to de-escalating the conflict.
Thanks for the article, after a quick skim I’m definitely going to sit down and read it properly. My honest question is do you think this is actually a step to de-escalate the Ethics/Safety fued? Sadly, my own thoughts have become a lot more pessimistic over the last ~year or so, and I think asking the ‘Safety’ side to make a unilateral de-escalatory step is unlikely to actually lead to much progress.
(if the above questions aren’t pertinent to your post here, happy to pick them up in DMs or some other medium)
do the major labs actually do their own scraping, or do other companies do the scraping which the major AI labs pay for?
For major labs I know of (OpenAI, DeepMind, Anthropic) in terms of those that have released the most “generally functional” models, they mostly seem to do their own scraping at this point.
In the early days, OpenAI made use of the BookCorpus and CommonCrawl dataset, but if those are still included, they would be a small portion of total datasets. Maybe OpenAI used an earlier version of the books3 dataset for training GPT-3?
Of course, they are still delegating work by finding websites to scrape from (pirated book websites are a thing). But I think they used academic and open-source datasets comparatively the most some years ago.
And then there are “less major” AI labs that have relied on “open-source” datasets, like Meta using Books3 to train the previous LLaMA model (for the current model, Meta did not disclose datasets) and StabilityAI using LAION (while funding and offering compute to the LAION group, which under German law means that the LAION dataset can no longer be treated as “research-only”).
Again, for my clarification, what do you think about this article?
I think the background research is excellent, and the facts mentioned mostly seem correct to me (except this quote: “Despite the potential risks, EAs broadly believe super-intelligent AI should be pursued at all costs.”). I am biased though since I was interviewed for that article.
What are your thoughts? Curious.
My honest question is do you think this is actually a step to de-escalate the Ethics/Safety fued?
People in AI Ethics are very sharp at noticing power imbalances. One of the frustrations voiced by AI Ethics researchers is how much money is slushing through/around the AI Safety community, yet AI Safety folks don’t seem to give a shit about preventing the harms that are already happening now.
I expect no-strings-attached funding for creatives will help de-escalate the conflicts somewhat. The two communities are never going to like each other. But an AI Safety funder can take the intense edge off a bit, through some no-strings-attached funding support. And that will help the two communities not totally hamper each others’ efforts.
I think asking the ‘Safety’ side to make a unilateral de-escalatory step is unlikely to actually lead to much progress.
I mean, is it a sacrifice for the AI Safety community to help creatives restrict data laundering? If it is not a sacrifice, but actually also helps AI Safety’s cause, why not do it?
At the very least, it’s an attempt at reconciling differences in a constructive way (without demanding from AI Ethics folks to engage in “reasonable” conversations, which I’ve seen people do a few times on Twitter). AI Ethics researchers can choose how to respond to that, but at least we have done the obvious things to make amends from our side.
Making the first move, and being willing to do things that help the cause of both communities, would reflect well on the AI Safety community (including from the outside public’s perspective, who are forming increasingly negative opinions on where longtermist initiatives have gotten us).
We need to swallow any pride, and do what we can to make conversations be a bit more constructive. We are facing increasing harms here on a path to mass extinction.
Hi Remmelt, thanks for sharing these thoughts! I actually generally agree that mitigating and avoiding harms from AI should involve broad democratic participation rather than narrow technical focus—it reminded me a lot of Gideon’s post “We are fighting a shared battle”. So view the questions below as more nitpicks, as I mostly agreed with the ‘vibe’ of your post.
Quick question for my understanding, do the major labs actually do their own scraping, or do other companies do the scraping which the major AI labs pay for? I’m thinking of the use of Common Crawl to train LLMs here for instance. It potentially might affect the legal angle, though that’s not my area of expertise.
Again, for my clarification, what do you think about this article? I have my own thoughts but would like to hear your take.
Thanks for the article, after a quick skim I’m definitely going to sit down and read it properly. My honest question is do you think this is actually a step to de-escalate the Ethics/Safety fued? Sadly, my own thoughts have become a lot more pessimistic over the last ~year or so, and I think asking the ‘Safety’ side to make a unilateral de-escalatory step is unlikely to actually lead to much progress.
(if the above questions aren’t pertinent to your post here, happy to pick them up in DMs or some other medium)
Hey, thank you too for the nitty-gritty thoughts!
For major labs I know of (OpenAI, DeepMind, Anthropic) in terms of those that have released the most “generally functional” models, they mostly seem to do their own scraping at this point.
In the early days, OpenAI made use of the BookCorpus and CommonCrawl dataset, but if those are still included, they would be a small portion of total datasets. Maybe OpenAI used an earlier version of the books3 dataset for training GPT-3?
Of course, they are still delegating work by finding websites to scrape from (pirated book websites are a thing). But I think they used academic and open-source datasets comparatively the most some years ago.
And then there are “less major” AI labs that have relied on “open-source” datasets, like Meta using Books3 to train the previous LLaMA model (for the current model, Meta did not disclose datasets) and StabilityAI using LAION (while funding and offering compute to the LAION group, which under German law means that the LAION dataset can no longer be treated as “research-only”).
I think the background research is excellent, and the facts mentioned mostly seem correct to me (except this quote: “Despite the potential risks, EAs broadly believe super-intelligent AI should be pursued at all costs.”). I am biased though since I was interviewed for that article.
What are your thoughts? Curious.
People in AI Ethics are very sharp at noticing power imbalances.
One of the frustrations voiced by AI Ethics researchers is how much money is slushing through/around the AI Safety community, yet AI Safety folks don’t seem to give a shit about preventing the harms that are already happening now.
I expect no-strings-attached funding for creatives will help de-escalate the conflicts somewhat.
The two communities are never going to like each other. But an AI Safety funder can take the intense edge off a bit, through some no-strings-attached funding support. And that will help the two communities not totally hamper each others’ efforts.
I mean, is it a sacrifice for the AI Safety community to help creatives restrict data laundering?
If it is not a sacrifice, but actually also helps AI Safety’s cause, why not do it?
At the very least, it’s an attempt at reconciling differences in a constructive way (without demanding from AI Ethics folks to engage in “reasonable” conversations, which I’ve seen people do a few times on Twitter). AI Ethics researchers can choose how to respond to that, but at least we have done the obvious things to make amends from our side.
Making the first move, and being willing to do things that help the cause of both communities, would reflect well on the AI Safety community (including from the outside public’s perspective, who are forming increasingly negative opinions on where longtermist initiatives have gotten us).
We need to swallow any pride, and do what we can to make conversations be a bit more constructive. We are facing increasing harms here on a path to mass extinction.
Just read through your thoughts, and responded.
I appreciate your honesty here, and the way you stay willing to be open to new opinions, even when things are looking this pessimistic.