Open Web Crawl is such a security vulnerability, that I don’t know why it isn’t the top of the news every day.
If you turn on a general suction hose, how do you not realise there’s going to be a party of attackers right there feeding it all the #propaganda they possibly can?
How can you be so nonchalant about it? How do you not realise you created the biggest attack vector in the history of computing?
Daniel Friedlaender explains why AI #innovation depends on #data diversity – and how outdated #privacy approaches are a disadvantage in the global AI race.
The recent copyright decisions against AI have made it imperative to rethink how we train AI. Ultimately, we should aim to build a free training data repository to train genuinely free AI.
#AI #trainingdata
https://www.korte.co/3iqx
Many critics maintain that AI cannot be open sourced in principle (cited: Lessing, Casado, Stoica). To me it seems clear that all people investing into public uses for AI have a duty to demand legal clarity and open access to #trainingdata // / @simonschlauri delivers a constructive and balanced exposé at #Winterkongress https://winterkongress.ch/2025/talks/open_source_artificial_intelligence/
The worst-case scenario here is you get sued?
Apparently, this author/text was not included in the Book3 dataset of pirated content used for LLM #trainingdata.
'The New York Times' takes OpenAI to court. ChatGPT's future could be on the line
A group of news organizations, led by The New York Times, took ChatGPT maker OpenAI to federal court on Tuesday in a hearing that could determine whether the tech company has to face the publishers in a high-profile copyright infringement trial.
#NYT #media #copyright #legal #ChatGPT #OpenAI #artificialintellilgence #AI #LLM #data #TrainingData #data #technololgy #tech
https://www.npr.org/2025/01/14/nx-s1-5258952/new-york-times-openai-microsoft
We've made #Swedish language training data for development of #HTR models available for download, https://riksarkivet.se/psidata/traningsdata-for-htr-modeller
This data, together with data from other archives whose training data is not for us to publish, is the basis for our HTR-model Swedish Lion Libre, https://huggingface.co/collections/Riksarkivet/htrflow-v012-models-66fd07cfd45a5a9b690fdcac
If you do use the training data, the model or, even better, you have ground-truth data you'd like to share, just get in touch!
My latest article on unlawful training of AI models and what we should do about it just dropped on TheRegister today:
Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft
Harvard University is releasing a high-quality dataset of nearly 1 million #publicdomain books that could be used by anyone to train large language models and other AI tools. It contains books scanned as part of the #GoogleBooks project that are no longer protected by copyright
#Harvard #Microsoft #OpenAI #ArtificialIntelligence #AI #data #bigdata #trainingdata #technology #tech
https://www.wired.com/story/harvard-ai-training-dataset-openai-microsoft/
Training a Self-Driving Kart - There are certain tasks that humans perform every day that are notoriously difficu... - https://hackaday.com/2024/12/21/__trashed-11/ #convolutionalneuralnetwork #machinelearning #self-driving #trainingdata #autonomous #crazykart #training #go-kart
Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft. The project’s leader says that allowing everyone to access the collection of public-domain books will help “level the playing field” in the AI industry.
https://archive.is/DrzFn#selection-575.0-581.152
#AI #Harvard #trainingdata
Senate Bill Targets AI ‘Black Box’ Problem, Eyes Transparency in Use of Copyrighted Works
Transparency and Responsibility for Artificial Intelligence Networks (TRAIN) Act on Monday in the latest effort to shield songwriters, musicians and other creators from the unauthorized use of their works in training generative AI models.
#copyright #music #legal #TRAIN #ArtificialIntelligence #AI #data #bigdata #TrainingData #technology #tech
Dialogue from 53,000 movies and 85,000 TV episodes is included in an AI-training data set that has been used by Apple, Anthropic, Meta, Nvidia, Salesforce, Bloomberg, and other companies.
It includes writing from every film nominated for Best Picture from 1950 to 2016 and at least 616 episodes of The Simpsons.
#OpenSubtitles #hollywood #TV #movies #copyright #ArtificialIntelligence #AI #LLM #TrainingData #data #bigdata #technology #tech
https://www.theatlantic.com/technology/archive/2024/11/opensubtitles-ai-data-set/680650/
@wook @kottke
I’m not
AI Companies Are Reportedly Struggling to Come Up With New and Better Products https://petapixel.com/2024/11/13/ai-companies-are-reportedly-struggling-to-come-up-with-new-and-better-products/ #trainingdata #aitraining #Editorial #anthropic #google #openai #News
My latest article "AI and the Fruit of the Poisonous Tree" is scheduled to be #published in the #technical #press tomorrow.
The article explores ways to motivate #AI companies and large #platforms seeding AI #trainingdata from #usercontent, unlawfully - to meet their #legal #obligations so that we do not find ourselves in the same place in 20 years, that we are currently in with #privacy, #dataprotection and #cybersecurity.