No, training and inference are two separate processes. Training data is never redistributed, only obtained and analyzed. What matters is what data is put into context during inference. This is controlled by the user.
AI/ML is complex, so as a simpler analogy: If I watch The Simpsons, and I create an amusing infographic of how often Homer says "D'oh!" over time, my infographic would be an original work. AI training follows the same principle.
If you really believe that then we can't have a meaningful conversation about this, that's not even ELIF territory, that's just disconnected. You should be asking questions, not telling people how it works.
How exactly is it different? All the model itself is is a probability distribution for next token given input, fitted to a giant corpus. i.e. a description of statistical properties. On its own it doesn't even "do" anything, but even if you wrap that in a text generator and feed it literal gcc source code fragments as input context, it will quickly diverge. Because it's not a copy of gcc. It doesn't contain a copy of gcc. It's a description of what language is common in code in general.
In fact we could make this concrete: use the model as the prediction stage in a compressor, and compress gcc with it. The residual is the extent to which it doesn't contain gcc.
There already have been multiple documented cases of LLMs spitting out fairly large chunks of the input corpus. There have been some experiments to get it to replicate the entirety of 'Moby Dick' with some success for one model but less success with others most likely due to output filtering to prevent the generation of such texts, but that doesn't mean they're not in there in some form. And how could they not be, it is just a lossy compression mechanism, the degree of loss is not really all that relevant to the discussion.
I see a test where one model managed to 85% reproduce a paragraph given 3 input paragraphs under 50% of the time.
So it can't even produce 1 paragraph given 3 as input, and it can't even get close half the time.
"Contains Moby Dick" would be something like you give it the first paragraph and it produces the rest of the book. What we have here instead is a statistical model that when given passages can do an okay job at predicting a sentence or two, but otherwise quickly diverges.
I'm no longer certain what point you're trying to make.
Getting close less than half the time given three paragraphs as input still sounds like red-handed copyright infringement to me.
If I sample a copyrighted song in my new track, clip it, slow it down, and decimate the bit rate, a court would not let me off the hook.
It doesn't matter how much context you push into these things. If I feed them 50% of Moby Dick and they produce the next word, and I can repeatedly do that to produce the entire book (I'm pretty sure the number of attempts is wholly irrelevant: we're impossibly far from monkeys on typewriters) then we can prove the statistical model encodes the book. The further we are from that (and the more we can generate with less) then the stronger the case is. It's a pretty strong case!
> If I feed them 50% of Moby Dick and they produce the next word and I can repeatedly do that to produce the entire book... then we can prove the statistical model encodes the book.
It can't because it doesn't. That's what it means to say it diverges.
The "number of attempts" is you cheating. You're giving it the book when you let it try again word by word until it gets the correct answer, and then claiming it produced the book. That's exactly the residual that I said characterizes the extent to which it doesn't know the book. Trivially, no matter how bad the model is, if you give it the residual, it can losslessly compress anything at all.
If you had a simple model that just predicts next word given current word (trained on word pair frequency across all English text, or even all text excluding Moby Dick), and then give it retries until it gets the current word right, it will also quickly produce the book. Because it was your retry policy that encoded the book, not the model. Without that policy, it will get it wrong within a few words, just like these models do.
But it does encode it! Each subsequent token's probability space encodes the next word(s) of the book with a non-zero probability that is significantly higher than random noise.
If you had access to a model's top p selection then I'd bet the book is in there consistently for every token. Is it statistically significant? Might be!
I'm not cheating because the number of attempts is so low it's irrelevant.
If I were to take a copyrighted work and chunk it up into 1000 pieces and encrypt each piece with a unique key, and give you all the pieces and keys, would it still be the copyrighted work? What if I shave off the last bit of each key before I give them to you, so you have a 50% chance of guessing the correct key for each piece? What if I shave two bits? What if it's a million pieces? When does it become transformative or no longer infringing for me to distribute?
Consider a password consisting of random words each chosen from a 4k dictionary. Say you choose 10 words. Then your password has log_2(4k)*10 entropy.
Now consider a validator that tells you when you gets a word right. Then you can guess one word at a time, and your password strength is log_2(4k*10). Exponentially weaker.
You're constructing the second scenario and pretending it's the first.
Also in your 50% probability scenario, each word is 1 bit, and even 50-100 bits is unguessable. A 1000 word key where each word provides 1 bit would be absurdly strong.
You're still missing the point. The numbers don't matter because it's copyright infringement as long as I can get the book out. As long as I know the key, or the seed, I can get the book out. In court, how would you prove it's not infringement?
Because you put the book in. Again, this is measurable. Compress the book with a model as the predictor. The residual is you having to give it the answer. It's literally you telling it the book.
The point is that the AI's themselves and their backers are on the record as saying that the AI could reproduce copyrighted works in their entirety but that there are countermeasures in place to stop them from doing so.
I wonder what the results would be if I spent time to train a model up from scratch without any such constraints. But I'm much too busy with other stuff right now, but that would be an interesting challenge.
Yeah just like a star could appear inside of Earth from quantum pair production at any given moment. But realistically, it can't. And you can't even show a test where any model can get more than a few tokens in a row correct.
These companies just don't want to deal with people complaining that it reproduces something when they don't understand that they're literally giving it the answer.
Is the claim that these models can 1 shot a Simpsons episode remake with different camera angle and similar dialog from a prompt like "produce Simpsons episode S01E04"? Or are we falling into the "the user doesn't notice that they told the model the answer, and the model in fact did not memorize the thing" trap?
For the same price, Walmart is selling the HP Omnibook 5. It has a better CPU, 16 GB RAM (double), and 1 TB SSD (double).
There's also another HP for $359 with 8 GB RAM & 1 TB SSD. For half the price of this MacBook Neo, it should offer comparable performance with double the storage.
> For anyone wondering how this happened, a bit lower in the comments OP says that this key was UPLOADED TO GITHUB, I mean, that should really be the end of the thread. It sucks that this happened, sucks even more that you have a personal card on file, but the fact that Google is even saying ‘shared responsibility’ after you uploaded it to GH is crazy, it is your responsibility.
I've said it before, and I'll say it again: The standard should be that devices ask whether the user is a minor during setup, and make that available as an is_minor boolean to all apps and websites. Children's devices are almost always set up by parents, and the setting can be protected by a parental PIN code. This method is effective while being completely private and local.
Though I can't take credit for the idea. It was proposed by the European Democratic Party.[0]
It's only effective if "Children's devices are almost always set up by parents", which is a big assumption. My parents were about as tech savvy as you could reasonably expect but I still got away with buying R-rated video games and such. Kids are persistent and the dangers aren't always obvious.
If kids are being persistent and the parent is indifferent to it, then online age verification won't be effective either. Children will just ask mom and dad to verify their Roblox and Discord accounts.
For sure, I'm not blanket supporting age verification technology. Just saying the alternative proposed by the parent commenter isn't very reliable either.
Both. The same as for other materials we don't want kids to access, like alcohol. We can't expect parents to always be watching their kids. That's not how societies have ever worked.
But what I'm actually questioning in my comment above is effectiveness of the technology solution proposed at the device level.
It's effective insofar as the parents secure the device. If it's a general purpose computer, and the parent forgets to lock the bios, kids will just live boot into Ubuntu or some other OS and do as they please.
Or they may install keyloggers (including hardware loggers) to get the parents' password and then go update their account.
Certainly this may help hinder them, but it won't take long for them to learn the basics of curcumvention, and the cost is regulated speech for OS manufacturers.
I did some experimenting using different search engines and AIs. Here's the results:
Google and Brave linked to the official GitHub repo followed by the fake domain. DuckDuckGo and Bing linked to the fake domain first, followed by the official GitHub. Mojeek gave higher ranking to two third party articles, but linked to both the official GitHub and website without fakes. Qwant was the worst, as the official website was the second result amongst multiple fake websites and an unrelated GitHub repo.
Then there the AIs. ChatGPT, Google AI mode, Gemini, Grok, Perplexity, and Brave Search "Ask" all linked to the official website, and some added the GitHub repo as well. DuckDuckGo Search Assist linked to just the official GitHub. Google AI mode, Gemini and Grok also explicitly warned about the fake websites. Copilot got the official website and GitHub right, but linked to a presumably fake X account as well.
Conclusion: Google, Brave and Mojeek win in search. AI is very good and clearly beats search overall. Google AI mode, Gemini and Grok stand out in quality.
For you... But the results are different for different users.
For me Google shows the .net site first the github one as second.
Asking chatgpt 5.2 (Auto mode) to search for the nanoclaw site, it says the same, first links the .net site and shows the github as an optional page.
When I try to give it a hint by asking "are you sure?" it still even hallucinates that it's linked from the github:
"Yes — nanoclaw.net is the official documentation/site for the NanoClaw project, in the sense that it’s the project’s published homepage and is directly linked from its canonical open-source repository. It describes the project, features, installation steps, and links to the source code on GitHub, which is the authoritative source for the project’s codebase."
Chatgpt 5.2 (Thinking mode) and Claude gets it right the first try, they asnwer with the official .dev page first and claude shows the .net second as "another site covering the project".
I tried AltPower Search and it exhibits the same issue as Google.
I think you might just need to give it more time to index. Nanoclaw.dev has only been available for a week. Then, it's the lower relative reputation of the 'dev' vs. the 'net' domain ...
It depends on how the payment app works. Android provides a native Contactless Payments API which can be used by any wallet app. This is local to the device and works flawlessly on GrapheneOS as well. You can set your preferred wallet app for this feature under NFC settings.
Google Pay/Wallet is one of the wallet apps using this API. If you use Google Pay, you set it as your preferred wallet app, and Google will act as an intermediary between you and whatever payment method you've configured in Google Wallet. It's this Google Pay app that's broken.
Banking, payment and wallet apps that implement the Contactless Payments API work normally as they should. But, some banks have lazy developers, and just hyperlink you to add your card to Google Wallet instead.
You're presenting a false dichotomy. I'd argue that installing and setting up GrapheneOS on a Pixel is as-much or less effort compared to setting up an iPhone. And it gives you full freedom and the best possible security while doing so. You can have everything at once.
If your relatives are significantly tech illiterate, I'd skip the smartphone entirely and go for a locked-down Linux desktop + feature phone. The most dangerous apps are big legitimate ones.
If you do go for a smartphone, my experience tells me that there's no difference between Android and iOS. The biggest sources for shady apps are the Google Play Store and Apple App Store. Shady stuff on the web can be easily defeated using an adblocking browser, which is essential for older relatives.
> If your relatives are significantly tech illiterate, I'd skip the smartphone entirely and go for a locked-down Linux desktop + feature phone. The most dangerous apps are big legitimate ones.
You know, they are adults and have free will and do want a smartphone like everyone else to use Whatsapp, read the news, search things on Google, etc.
Hell, my 95 year old grandma convinced a nurse to install TikTok on her phone because she saw her using it and also wanted to try it. It's not like we can isolate them from the world
Sir no sir. I believe entirely the opposite. If they're tech illiterate then they don't have the entrenched knowledge that is the only thing keeping most people within the Windows ecosystem.
A Linux install that meets the basic needs of the user is perfecto!
Less so recently just due to time constraints, but I'm generally the technical person in my family group, and I've lost enough touch with Windows that troubleshooting it is increasingly difficult. If they need me to 'format and reinstall' they're getting Linux unless they have a very specific need that only Windows can cater to.
It's getting less silly every month! So many people in that boat only use the web browser anyway.
With a well-supported hardware configuration and a working web browser, even a non-techie may have a more stable experience than they would with Windows.
That has as much to do with the decline of Windows as with the ascent of desktop Linux, but still.
I don't believe you need internet connection - IRRC jailbreaking steps were plug in the Kindle, drop the jailbreak folder into the root directory, then choose `Update` from the Settings screen.
The hardest part was finding the `Update` menu item. It's only visible if you go to Settings, then press the menu button again while on the Settings page.
Depends on the firmware of the device. Latest firmware (anything after version 5.18.5.01 - which released in October 2025) is currently not jailbreakable.
Jailbreak of any firmware after version 5.16.2.1.1 (June 2023) requires the Kindle to be connected and registered.
Anything prior to, and including this version, can be jailbroken with no registration.
reply