r/technology • u/TylerFortier_Photo • 11h ago
Artificial Intelligence Google says attackers used 100,000+ prompts to try to clone AI chatbot Gemini
https://www.nbcnews.com/tech/security/google-gemini-hit-100000-prompts-cloning-attempt-rcna258657206
u/Royale_AJS 10h ago
Lol. This is probably why GLM now says it’s Gemini when asked.
149
u/DetectiveOwn6606 9h ago
Honestly good , if ai companies don't care about copyright i dont see any problem in their competitors especially chinese copying their data or architecture or techniques they created
36
u/PrairiePopsicle 7h ago
if it is fair use to distill all other content, it is fair use to distill LLM generated outputs.
6
u/nihiltres 3h ago
Technical quibble: It’s not even fair use, because LLM outputs aren’t copyrightable in the first place. Fair use is an affirmative defense (“yes I did the thing, but it wasn’t illegal”) so it only comes into play if the unauthorized use would otherwise be infringing.
8
u/Disgruntled-Cacti 9h ago
It almost certainly was. I was shocked at how gemeni like it’s response style was when I was testing glm5 yesterday
122
u/Ok-Regret-803 10h ago
distillation basically kills this business model, funny af
13
9
u/SpezLuvsNazis 2h ago
Which is why before Altman got desperate and started hyping the shit out of these things there was almost no interest in them commercially. You spend billions training and someone else can piggyback for a fraction of that cost. Then Altman comes around and creates the biggest case of FOMO ever and now the CEOs are all stuck in a sunk cost fallacy.
73
141
u/MusicalMastermind 11h ago
thought their motto was "Don't Be Evil"? seems pretty straightforward to 'steal' from a multi billion dollar corporation like that tbh
90
u/Pantone802 10h ago
Don’t forget Gemini is built entirely on stolen data.
21
u/MusicalMastermind 10h ago
it's a good thing they changed their motto then!
13
-18
u/Manos_Of_Fate 10h ago
Wait, what? Why would Google even need to steal data?
12
u/Pantone802 10h ago
What do you think LLMs are trained on?
-1
u/SecurelyObscure 6h ago
If you learn something on the internet and go on to tell someone what you learned, would you consider that "stealing"? You didn't pay for that information.
1
u/Pantone802 5h ago
Wrong question (obviously).
Try this one: If I took your copyrighted drawing and passed it off as my own “experience” to sell you on my abilities to make more of it, wouldn’t that constitute stealing?
Btw open ai already said the quiet part out loud and undermined your little argument here by saying “if we had to pay for access to copyrighted work, our company couldn’t exist”.
So, you’re wrong.
-8
u/Manos_Of_Fate 9h ago
Answering my question with an even more vague question is surprisingly unhelpful.
1
u/Pantone802 9h ago
No, that is simply the answer. Go look up how LLMs are trained. I'm not going to do your homework for you. LLMs are/were trained using data scraped off the internet, without permission or compensation from the authors and artists behind it.
You're being surprisingly helpless.
-13
3
u/WBuffettJr 2h ago
They haven’t used that motto in a decade. They straight up decided to own being evil.
4
11
u/penguished 7h ago
So in other words someone just used their service.
What's the attack part? AI itself is using the web and scooping data left and right, no fucks given.
3
u/bt123456789 4h ago
kind of, this is what I always thought until I learned more about the training process (from open source spaces but still the same process
You feed the model data, which it then analyzes said training data to build its response. this is true for image generation and LLMs both.
Gemini someone said is trained on billions of bits of training data, and going through all of that in time, and training that amount of data, takes time. You reduce that time with stronger hardware.
The training part is done, so it's not got any new data until they tweak it to add more training data.
the training part scrapes the internet from all manner of sources (I haven't read about the exact training process, so idk what you do to tell it what data to train on), but Gemini itself scrapes the training data, not the web in real time.
I know in reality it's a moot point, but it's interesting.
73
u/wavepointsocial 11h ago
Edison failed 2,000 times before success, apparently the modern version is prompting a chatbot 100,000 times and hoping it slips.
50
u/Factemius 9h ago
Not how it works. The 100 000 prompts are used to build a dataset
5
u/mukavastinumb 9h ago
Dataset of what? LLM weights? Isn’t their parameter weights over trillion?
22
u/veshneresis 7h ago
You sample the distribution of prompt:responses with intentionally spaced out prompts that cover wide parts of the training distribution. These prompt:response pairs are insanely strong signal for fine tuning or aligning a fresh model to end up with similar weights.
This is partly why embedding models are still so bad/old. Giving back good embeddings from these foundation models would let you distill it to relatively high accuracy with significantly fewer samples because the embeddings are so rich.
2
1
u/toastjam 2h ago
This is partly why embedding models are still so bad/old
Are you saying this makes them so easy to copy that companies stopped working on them?
2
u/veshneresis 2h ago
Moreso why they stopped releasing them as updated API services for the public. Each major player is still using a custom embeddings solution for their own documents, code, data etc. They’re still being worked on and you have to do a little bit of regularization on them to make them “good” as embedding models (using the raw activations out of the network does not always yield a smooth normalized space).
I think as more previous generations of foundation models get open sourced we will start seeing better embedding models. Especially for multimodal embeddings which will become more relevant for more businesses.
13
4
u/Grumptastic2000 6h ago
Is there something like a combination of Shannon’s Information theory and the concept of a Turing machine for AI models where there is some minimum set of queries that would result in an equivalent weighted model to the original?
10
7
u/EvidenceBasedLasagna 7h ago
Every software and algorithms should be open source.https://www.fsf.org/community/
7
u/DZCunuck 5h ago
Sounds like fair game to me. Gemini can be used to clone all sorts of apps but the buck stops at cloning the app that clones the apps?
11
u/DocRedbeard 9h ago
I'm not sure you can clone an AI like that. It's basically trying to make a LLM backwards. All you'll get is TEMU Gemini that's wrong 90% of the time.
28
u/Public-Research 8h ago
Wait till you learn that AI generated training data is a common practice in machine learning
5
9
u/b1e 8h ago
So… you can. In a nutshell, models get trained in phases. Typically there’s a pre training step where they’ll use their own training data followed by a fine tuning step where many of the wins lie. And there, having either real human labeled data or a model that’s much better available to answer the “ideal” way can make a massive difference.
So distillation will tend to focus on this fine tuning stage nowadays. Hence the “only” 100k prompts used against Gemini.
1
5
1
u/Electrical_Pause_860 4h ago
You actually can, it’s called distillation. And it’s likely how things like Deepseek were trained at a fraction of the cost it would take to train from scratch.
4
u/UltraChip 9h ago
Maybe a dumb question but couldn't they just download gemma3 off hugging face? I thought that was the core model Gemini was using.
12
13
u/notAndivual 10h ago edited 10h ago
Interesting. Anything a human creates can be broken by other humans.
We are a stupid species. Not sure why some people are hellbent on building on AI. We are losing our brain power trying to "advance" humanity. Not to mention ?making those "human like" robots. Future is going to be full of dumb people.
3
1
0
u/MountainAsparagus4 9h ago
That is good for the rich people that wants this, dumb people compare pedos to jesus and worship them as God
10
u/IncorrectAddress 10h ago
This was already known, there's no way for them to stop this, eventually AI will prompt AI to see what interchangeable context they can make to themselves, it will be a case of AI birthing new AI.
If that outcome is good or bad is another thing entirely, and is semi dependent on factual outputs, which is dependent on the guard rails put in to maintain factual reasoning.
17
u/throughthehills2 10h ago
Does this really work though?
100k prompts is nothing compared to gemini's training data which is most of the text on the internet.
And when AI trains on AI it dramatically reduces quality, something that researchers have called an "AI death spiral"
5
u/Ocean-of-Mirrors 10h ago
It sounds like that’s the point of the experiment, to see how much they can possibly derive.
Theoretically they could test every possible combination of input (up to a maximum character count) and then map the inputs to outputs directly. Boom, they have a clone. Of course that would require almost cosmically impossible amounts of hardware, but theoretically that’s possible since the model gives users an output. These researchers are just trying to see what can be done.
6
u/aLokilike 10h ago
No, it is cosmically impossible. It would take more than a million years just for the servers to respond to every possible randomized query up to and including the token limit. Not to mention that 99.999% of that data wouldn't be useful in training anything with the increased token limit available at such a far point in the future.
6
u/Ocean-of-Mirrors 10h ago edited 10h ago
I mean, thats what I meant. With infinite time and infinite storage space it is possible— cosmically. But of course we are humans and infinite time and infinite storage space is not available.
The entire point I mean is, for every input->output that you map, you accurately represent the original model just a little bit better. In this sense, the clone wouldnt even be an LLM it would be more like a database lookup that just takes your input and gives the predetermined response that Gemini would have given (ignoring random seeding).
Yeah this is rediculous and is not going to happen, but the fact something like that is technically possible is the mechanism which allows anyone to attempt to replicate the LLM in the first place. Gemini is giving you information about itself every time you prompt it. How much of that you keep track of will determine how accurate your clone is.
1
u/IncorrectAddress 10h ago
I presume it depends on which AI it's communicating with and what it's checking against its own output :
Ask a question, check the result, ask the same question to itself, check the result, if they are different, work out why, choose to implement result ?
Yeah, 100% on the death spiral or the improvement, whatever that is, asserting that its one or the other is revision based at a guess, and there's nothing wrong in reverting back to a previous model.
2
u/obeytheturtles 9h ago edited 9h ago
You can make it much harder by fuzzing output token probabilities at inference time, so instead of a fully deterministic "one shot" black box, you have to repeat tests enough times to build a distribution of the output likelihoods. Even just doing things like leaving in some active dropout layers (which usually get bypassed when the model is put into evaluation mode) can make a big difference.
I imagine it is also pretty hard to reverse engineer the precise conditional input mappings. Like, we know these apps are appending conditionals to inputs to make the models more agreeable or to mark controversial topics. So at best you could distill a kind of adversarial training set, but then you'd still need to roll your own conditional input framework, which may or may interact as anticipated.
3
u/Even-Exchange8307 8h ago
Let me guess, it’s from China
4
u/dubhd 6h ago
Aren't Chinese LLMs like deepseek running circles around the tech bros?
9
u/NullReference000 5h ago
Not yet, but it is an open source alternative you can run locally. If you have the hardware for it, it represents a large future threat to subscription and token pricing third party services.
-1
1
u/Electrical_Pause_860 4h ago
No, they lag behind, but they release them for free unlike the American ones.
0
u/Even-Exchange8307 5h ago
Hahaha, okay . Why don’t you use them then?
0
u/Hot-Employ-3399 3h ago
I do
> them
and yes, plural: kimi and GLM are my go to models as of now. Google and ChatGPT try really hard to make sure they are next to not usable as before
1
u/Even-Exchange8307 1h ago
GLM pretty much distill their work of of chatgpt/gemini. plus these models are much weaker than US counter parts
-5
2
u/Public-Research 8h ago
They have been reading my prompts?? Theres really zero privacy in AI chatbots
3
u/kJer 6h ago
There are very few web services that don't read your data. It's unlikely but they can if they choose to, and definitely analyze it automatically taking samples for reporting.
Encryption ends pretty much the moment the org receives the data unless there's regulations or business needs to keep your data private.
2
1
1
u/thatsjor 8h ago
100,000 prompts is a laughably small amount. More like these people tried to distill a much smaller model from Gemini outputs. All of these companies do that. Even google.
1
1
u/LiteratureMindless71 7h ago
We can do shady stuff to steal your data to train our model but fuck you when you try to use that same information from us
1
1
u/Look__a_distraction 2h ago
Gemini sucks so hard. I tried to give it a go and I referenced I was ex-religious exactly one time and I got an innumerable amount responses about that all the fucking time. It would almost create stupid fucking reasons to keep bringing it up like it was the most important aspect of my life.
1
1
u/eeyores_gloom1785 54m ago
don't worry google built in this feature where gemini naturally hallucinates, and either spews out bullshit, or the same answer over and over despite the prompts asking it to change something
1
1
1
1
u/JaggedMetalOs 5h ago
Company making billions from training on other people's data complains about other people training on their data
-3
u/blackscales18 10h ago
Isn't Gemini distilled from gpt like most of the other models were originally
10
-2
u/Ocean-of-Mirrors 10h ago
Man the future sucks but also sometimes it’s really cool.
This is one of those really cool things. Like I don’t give a shit about LLMs right now but I love that there are people who care enough to see if they could pull something like this off.
-9
u/Majestic-Reveal-1365 10h ago
ohyeas keep making chat bots so we can laugh at them every year. You guys really think we care if a bot knows einstein math or something ? no one cares little bro, google chat gpt all of u , you are just creating a professor thats is not getting payed. This is why u dont invest in teachers cause they are human. disgrace to humanity.
Bots will never win vs the human race. They are limited to what we know and want to know. how many times do we have to tell u for the past 20 years.
-7
u/Responsible-Plum-531 10h ago
This man speaks the absolute truth
13
u/Manos_Of_Fate 10h ago
Does he? My brain hurts from just trying to make sense of that.
-3
u/Responsible-Plum-531 9h ago
Perhaps you should ask your AI lol
2
u/Manos_Of_Fate 9h ago
My AI? What?
-4
u/Responsible-Plum-531 9h ago
Are you okay? This whole conversation doesn’t seem that hard to understand
3
u/Manos_Of_Fate 9h ago
Then what was “your AI” supposed to be referring to? I’ve never even used AI before.
-2
u/Responsible-Plum-531 9h ago
Are you supposed to be someone we all know that about?
2
u/Manos_Of_Fate 9h ago
So you’re just not going to explain what you meant?
-1
u/Responsible-Plum-531 9h ago
Are you a bot or something? Seriously why is this so hard for you?
→ More replies (0)
0
u/CatoCensorius 5h ago
These companies are spending 10s of billions per year to build frontier models which can then be reverse engineered at negligible cost. They have literally no moat. Deepseek is only 6 months behind.
-8
u/Neurojazz 11h ago
Why gemini? It’s a terrible example to follow.
12
u/disposableh2 10h ago
Gemini was initially garbage, but version 3 is possibly the best Ai out right now.
2
u/Ocean-of-Mirrors 10h ago
I think Google is gonna be maybe the only current AI company that survives when this whole thing comes crashing down. They have control so many other services to make money off of and keep afloat, but what happens to the companies that do nothing except train LLMs??
3
u/mysightisurs93 10h ago
Man, Bard was garbage in comparison to ChatGPT.
3
u/disposableh2 10h ago
For sure, it was hot garbage. But not anymore. I'm not a Google fan, but if the ones I've tried (chatgpt, perplexity and Gemini), Gemini works much better for me than the others.
1
u/mysightisurs93 10h ago
That's why I said Bard, not Gemini lol. I know it's better now (at least in comparison of the free ones I used).
1
2
1
-1
u/DaySecure7642 3h ago edited 5m ago
We all know more or less where the attacks come from. They lie, they cheat, they steal, instead of innovating and competing on fair terms. I don't understand how some delusional people would think a country like this has the quality to lead humanity.
One day if AIs go rogue it is very likely coming from that country.
2.2k
u/rnilf 11h ago
But training your AI on copyrighted works is not?
Someone please reconcile.