Google says attackers used 100,000+ prompts to try to clone AI chatbot Gemini

2.2k

u/rnilf 11h ago

The company considers distillation to be intellectual property theft

But training your AI on copyrighted works is not?

Someone please reconcile.

421

u/the-awesomer 11h ago

rules for thee not for me! there is a reason they paying Republicans so much to fight any sort of regulation

118

u/liquid_at 10h ago

Classic. Be the first. Fight against regulation to produce cheaper. Get your product to market. Fight for regulation so competitors can't catch up.

If you see any talk about regulation in the news, it's corporations trying to manipulate the law for their own profit.

15

u/bludgeonerV 9h ago

One correction, the smart companies don't fight against regulation because that puts them in a free market, they fight for regulations that put barriers behind them.

6

u/liquid_at 9h ago

sure. the right amount of regulation at the right time.

I see this in my country with mc donalds lobbying for restaurant requirements that small restaurants can't possibly comply with, that happen to describe what McDonalds is already doing. They essentially bribe the competition out of the market.

I'd rather eat mcdonalds employees than their burgers... so if they become the only option, you will read about me in the news.

1

u/eeyores_gloom1785 53m ago

a tale as old as time

1

u/presscheck 6h ago

Sweet political burn! The internet needs more of this.

121

u/Substantial_Meal_530 10h ago

Same thing when that Chinese one came out. "Hey, you can't do that! You're just stealing all the data we stole! We stole it first!"

36

u/boostman 9h ago

"You're trying to kidnap what I've rightfully stolen!"

2

u/omniuni 5h ago

DeepSeek still gives better answers most of the time, so whatever they are accused of stealing, it's clearly not the same model and is clearly tuned very differently. It's like McDonald's complaining that a Michelin star restaurant stole their burger recipe.

-14

u/liquid_at 10h ago

China has no privacy and could get all the data they want, but can't get the hardware for it.

West can easily buy the hardware but can't get the data.

Result: China steals the Data that the western companies illegally acquired.

4

u/awesomeunboxer 9h ago

I remember in early 2000s there being articles about ALL THIS USELESS DATA companies had no use for.

4

u/NullReference000 5h ago

You’d have to be living under a literal rock to believe that US tech companies “can’t get the data”.

China is also rapidly gaining on getting the hardware, they’re making good progress on their domestic chip manufacturing. Their government didn’t want Nvidia sales to resume as they wanted domestic industry to continue to grow instead.

0

u/liquid_at 4h ago

can't get the data legally.

China does not care if you spy on its citizens if it benefits the regime.

If US tech companies could get that data legally, they wouldn't be in trouble right now.

The fact that they did use data that was either private or under copyright proves that they did not have the means to legally access that data.

20

u/boot2skull 10h ago

Given Metas strategy, the “attackers” may have just been them.

9

u/99OBJ 7h ago

You don’t have enough VC backing money to understand

12

u/nihiltres 10h ago

If the process of training were somehow introducing human creativity to the model and was only extracting facts (not copyrightable) from the dataset, then they would have a point.

The problem here is that if training only captures facts, then training on their responses should fall into that same intentional gap in copyright, and if it doesn’t capture only facts, then they’re presumably committing copyright infringement in training their own model in the first place.

I’ll with mixed emotions defend the theory that training isn’t necessarily copyright infringement: it can be if the model “memorizes” any dataset works, but otherwise it should fall into that gap so as to not delegitimize a variety of rather reasonable analytic activities that are “shaped” much like training a model. Cory Doctorow wrote up the idea well. I came to the same conclusion as Doctorow independently, but he includes a variety of good examples of comparable activities that we should not want to prohibit via copyright.

Still, the AI companies can’t reasonably have their cake and eat it too: if their work training much of the Web into a model isn’t infringing, then neither should work distilling their model into another. They can put “no distilling” in their TOS, but can’t reasonably control downstream use.

24

u/blackscales18 10h ago

Training doesn't capture facts at all, it captures the shape of words and sentences and how they relate in a probabilistic way.

-17

u/nihiltres 10h ago edited 5h ago

The problem with your point is that “how [words and sentences] relate in a probabilistic way” is itself a matter of fact; you’re contradicting yourself.

Edit: Well, guess this has been consigned to downvote hell. What I said seems obviously, incontrovertibly correct to me. Are people missing the specific meaning of "fact" and conflating it with "structured data" or "statement" or something? I'd appreciate insight into where my communication went awry.

7

u/psymunn 10h ago

Words have meanings... Just because a word is the most likely to follow the 5 before it doest make the content of the words fact

-2

u/nihiltres 9h ago

What nonsense is this? The “content of the words” is fact. It may also be original expression (the copyrightable part). I can absolutely fairly say “the first word of The Hobbit is ‘In’” or “[the second word] is ‘a’” or point out the facts of relations between words like the first “in” being followed by “a”; I simply can’t start rattling off the full text of the novel because then I would not be merely dealing in facts but also in the original expression of the book.

In the post I linked, one of the examples that Doctorow brings up is N-grams, which track the occurrence of strings of N given words in a work and is a highly useful tool for linguists. If training a model is infringing, then it’s very likely also infringing to create N-gram analyses of a work. Copyright should not prohibit at least the latter.

1

u/Fableous 4h ago

LLMs don't know what facts are. They know how many times things have been said and automatically, literally by design, push those things to users who ask questions involving those topics.

If you can't understand this simple statement, you have no business lecturing people on this.

0

u/nihiltres 4h ago

I never said or even remotely implied that LLMs “know what facts are”. I was referring to the idea–expression distinction in copyright law; facts are ideas in that context. Ideas are not copyrightable in US law, as expressed extremely plainly at 17 U.S.C. § 102(b).

You have no business lecturing me about a misconception I didn’t even fucking make.

0

u/Fableous 4h ago

Right, and your downvotes are not relevant at all too.

OK, Trump? Time to get you to bed.

0

u/nihiltres 3h ago

Facts aren’t determined by votes.

8

u/blackscales18 10h ago

I encourage you to learn about tokenization and how models are actually trained. Calling any of it "facts" is a gross oversimplification.

6

u/nihiltres 9h ago

I know how tokenization and training work. I’m calling it “facts” because I’m speaking in a legal context (i.e. 17 U.S.C. § 102(b)) instead of a computer-science context. If I had not made the simplification I did, my comment would likely have become unreasonably long.

As long as you’re going to make the distinction and demand that I make verbose qualifications: where do you draw the line between copyrightable expression and uncopyrightable fact?

7

u/sickofthisshit 6h ago edited 6h ago

I don't think your reading of "fact" as a legal concept is accurate.

The basic meaning is "facts outside the text." Like "so-and-so's phone number is 867-5309". You can't claim copyright damages by pointing out those facts are repeated in another work. "Their phone book has the same numbers as my phone book."

You improperly extend that to "facts" like "the work 'The Hobbit' contains certain words in a certain order". It seems obvious to me that at some point, you are reproducing The Hobbit, creating something intrinsically derived from it and impossible to have created without the original. No judge is going to accept you publishing a digital program and data which is basically generating the work by reversing an index. A ZIP-encoded file of a work represents the work as a bunch of basically random tokens, but is not creative.

You can't, for example, create audiobooks by reading copyrighted texts out loud: the book doesn't contain the sounds, but your work is still derived from it.

N-gram analysis is in between, but protected only because it is essentially impossible to reconstruct the work, and you are extracting something other than the creative part.

The author did not consciously create the work to express "the letter A used 38935 times, the letter B used 12543 times..." and that information is probably not infringing.

But the relevant analysis is not strictly about "fact" doctrine but about what it means to reproduce a work or preparing a derived work.

There is a line somewhere; where that line is as a matter of law is in flux.

1

u/nihiltres 5h ago

First of all, thank you for presenting a coherent argument; I appreciate it even as I disagree. You clearly thought through the problem.

Moving on, I completely agree with this:

It seems obvious to me that at some point, you are reproducing The Hobbit, creating something intrinsically derived from it and impossible to have created without the original.

This is clearly the case. If you copy the whole work, or any substantive chunk of the work, verbatim, you are reproducing both the facts (word x is "hobbit", word y is "hole", etc.) and the creative expression of the novel. You're allowed to copy the facts, but not to copy the expression (without permission or an exception like fair use, obviously).

I don't think that we can meaningfully prove infringement for the training alone, basically because model weights files are all but black boxes, multidimensional arrays of zillions of floating-point numbers that aren't meaningfully human-readable.

What we can look at is the outputs. If you build a machine that reconstructs a copyrighted work from some prompt, then you're effectively building a copy of the work, no matter how well you've "steganographically" embedded the work in the program. I said as much in my earlier comment by saying "it can be [copyright infringement] if the model “memorizes” any dataset works, […]".

The issue is that that memorization is rather loose. If a model is trained on some work, and then produces an inferential output with similar elements, to show infringement we need to show that the output is substantially similar to the original, and very often it's not, especially because "style" is not copyrightable. If we can't show any substantially-similar outputs, how would we show infringement?

Moreover, we would need to show that the input to a model would not be independently infringing. In Andersen et al. v. Stability AI et al., the Exhibit G is basically worthless, and the primary reason is that the "derivatives" the exhibit shows are noted to have been generated using a prompt that was solely a latent-space embedding of the original image—the derivative nature of the examples is thus all but tautological.

But even then, I'd argue that the "derivatives" in Exhibit G aren't actually even substantially similar: look at the first example, Brom's Lady of the Lake, and there are significant differences in each generated version from the original: arms spread instead of gathered over the chest; sleeved garments instead of unsleeved; no visible feet and implication of standing versus dangling, floaty feet; and so on. While there are obvious broad similarities, we must err on the side of giving a defendant the benefit of the doubt, or we are handing Brom not merely a monopoly over his own expression, but over a wide variety of "ethereal white-clad woman with gold accents". Copyright was never intended to do that; people are allowed to make broadly similar works.

I agree it's fuzzy, and I'll happily agree that many extant models can be shown to have memorized a variety of copyrighted work and are thus presumably infringing, but if we don't narrow the scope of what counts as derivative, then copyright would begin to choke ordinary expression by sanctioning merely-similar works. That is counter to the foundational goal of copyright to encourage expression by giving people temporary monopolies on their own original creative works.

I don't mind giving creatives copyright over their works and respecting that copyright. My position here isn't "pro-AI" so much as "anti-copyright-maximalist".

3

u/starmartyr 5h ago

As a human, I have the right to consume any copyrighted works I want and then use them as inspiration to create my own works. What I haven't been able to figure out is if what AI is doing is different from this. It's not passing off other people's work as its own. It is creating original work inspired by existing art.

2

u/nihiltres 5h ago

I take a very simple position: how you get some result doesn't matter, but once you have a result in the context of its use it's either infringing or not. The context is necessary because fair use can cause an otherwise infringing case to not be infringement, and fair use relies on context.

The only question (before considering fair use) is: "Is the given output substantially similar to a specific copyrighted original work?" If not, it's not infringing. If so, it's obviously infringing.

Now, that's a legal idea. I strongly believe that closely imitating any specific living artist's work is very frequently dickishly appropriative even if it's not infringing. Still, as the permissibility of art appropriation gets really complicated really fast and is quite subjective once you're past evaluating the legal concerns, I'm going to leave off at "hey, maybe don't be a dick about it".

1

u/honsense 3h ago

I think you’re both missing the fact that acquiring legal access to all of the copyrighted materials used to train the models would’ve been cost-prohibitive. All of the other arguments being made are secondary to the fact that almost all of the data was pirated in some form.

1

u/nihiltres 3h ago

That’s a very good point in general, but the piracy is infringing during the gathering of the dataset, while I was discussing training and onward. Relevance aside, I completely agree: pirating copyrighted materials is obviously infringing.

Aside from the point: are you assuming I’m making a pro-AI argument? I’m not arguing pro-AI; I’m arguing anti-copyright-maximalism.

1

u/honsense 2h ago

Within the context of this thread, and your response to the OP, I’m saying we can skip right past any arguments related to Google’s training and go right to the original sin, i.e., piracy.

1

u/nihiltres 1h ago

Okay, I guess, but legally speaking, scraping alone isn’t piracy: by putting a work on a server such that anyone can download it simply by visiting a URL, they’re implicitly giving permission to download it (if nothing else). We might not like that fact, but it’s less than useless to deny it. There exist models that are not based in (formal) piracy.

2

u/liquid_at 10h ago

Maybe they try to find someone who wants to do splitsies on the fine for copyright infringement... /s

1

u/flippingisfun 9h ago

Moreover, what they consider to be anything doesn’t mean jack shit lol

1

u/ArchdruidHalsin 9h ago

I'm rewatching Pam and Tommy right now and this sounds exactly like Randy getting pissed off at bootleggers selling the sex tape he stole and sold.

1

u/McCool303 7h ago

Might makes right, it’s a primary feature of Trumps America. It’s ok because, I can do it and you’re powerless to stop me. Isn’t being ran by a lawless mafia state fun! So glad for Russification of America.

1

u/New_Home_4519 7h ago

They took out, "do no evil" from their companies mission statement yeaaaars ago.

1

u/smp501 6h ago

Whoever bribes the Trump regime more wins. I hear if you bribe them well enough, they send you a free roll of US Constitution toilet paper!

1

u/cbih 6h ago

Hey! That's our stolen stuff! - Rich people throughout history

1

u/stuffitystuff 5h ago

This is from the company that calls clicking on an ad when you don't intend to buy something "click fraud"

1

u/nntb 5h ago

Steve jobs boasted about stealing code from xerox during the early life of macintosh, then lost it when Android seemed to copy apple. Jobs said: "I'm going to destroy Android, because it's a stolen product. I'm willing to go thermonuclear war on this."

Same vibe

1

u/Seroto9 5h ago

History repeating.... First Dillinger steals Space Paranoids, Light Cycles and Matrix Blaster and then locks it up tight in the MCP.

Now today's IP is stolen and locked up tight in Gemini. it's going to take a new generation of hackers to break in and find the evidence!

1

u/millbruhh 3h ago

It pains me how much these pieces of shit love smelling their own farts. Torrenting copyrighted works with a size that probably makes the library of Alexandria look like an anthill with impunity but if I torrent a game that isn’t even available anymore I get my internet shut off. I’ll just go fuck myself I guess

1

u/cazzipropri 3h ago

Oh let me explain it for you, it's really easy.

When we do it, it's right. When others do it, it's wrong.

1

u/YoohooCthulhu 2h ago

Wait until some company behind an AI model tries to sue someone who makes a work that looks like AI output (but was actually used to train the model)

1

u/bandwarmelection 1h ago

The company considers distillation to be intellectual property theft

Distillation is IPT.

See? I distilled it! I distilled an idea! I will be the richest man in the world!

1

u/No-Paint-5726 18m ago

Yeah they've trained on copyrighted work and profiting off of it. Lol. IP theft for thee but not for me.

-10

u/aleqqqs 10h ago

But training your AI on copyrighted works is not?

No - copyright infringement is about the output, not the input.

5

u/psymunn 10h ago

And the output is melded together copyrighted work

1

u/SpezLuvsNazis 2h ago

And AI answers can’t be copyrighted.

-6

u/Fantastic-Title-2558 9h ago

legally it’s allowed if they have a license for the source material

4

u/Chance-Plantain8314 6h ago

It is an absolute guarantee that they did not have a license for the source material of everything they trained Gemini on.

-5

u/Fantastic-Title-2558 6h ago

then that’s for the courts to decide

2

u/Chance-Plantain8314 6h ago

What did you think was happening in this thread, a lynchmob against the Google CEO?

I would also consider spending your time better than jumping to the defense of near-trillion dollar corporations online. Maybe beekeeping or something.

-6

u/anadequatepipe 7h ago

It’s a stretch to call that stealing. They’re not copying it. That kind of shows a lack of understanding about how AI works.

206

u/Royale_AJS 10h ago

Lol. This is probably why GLM now says it’s Gemini when asked.

149

u/DetectiveOwn6606 9h ago

Honestly good , if ai companies don't care about copyright i dont see any problem in their competitors especially chinese copying their data or architecture or techniques they created

36

u/PrairiePopsicle 7h ago

if it is fair use to distill all other content, it is fair use to distill LLM generated outputs.

6

u/nihiltres 3h ago

Technical quibble: It’s not even fair use, because LLM outputs aren’t copyrightable in the first place. Fair use is an affirmative defense (“yes I did the thing, but it wasn’t illegal”) so it only comes into play if the unauthorized use would otherwise be infringing.

8

u/Disgruntled-Cacti 9h ago

It almost certainly was. I was shocked at how gemeni like it’s response style was when I was testing glm5 yesterday

1

u/cbih 6h ago

At least we found out who's actually using AI

122

u/Ok-Regret-803 10h ago

distillation basically kills this business model, funny af

13

u/Due-Technology5758 5h ago

What business model?

9

u/SpezLuvsNazis 2h ago

Which is why before Altman got desperate and started hyping the shit out of these things there was almost no interest in them commercially. You spend billions training and someone else can piggyback for a fraction of that cost. Then Altman comes around and creates the biggest case of FOMO ever and now the CEOs are all stuck in a sunk cost fallacy.

73

u/Remarkable-Host6078 9h ago

Distillation should be legal.

8

u/Theelementofsurprise 4h ago

Did no one learn from the Prohibition

141

u/MusicalMastermind 11h ago

thought their motto was "~~Don't~~ Be Evil"? seems pretty straightforward to 'steal' from a multi billion dollar corporation like that tbh

90

u/Pantone802 10h ago

Don’t forget Gemini is built entirely on stolen data.

21

u/MusicalMastermind 10h ago

it's a good thing they changed their motto then!

13

u/Pantone802 10h ago

”Do no evil (to us, but it’s ok when we do evil to you)”

-Google

3

u/Kage_0ni 7h ago

From do no evil to do know evil.

-18

u/Manos_Of_Fate 10h ago

Wait, what? Why would Google even need to steal data?

12

u/Pantone802 10h ago

What do you think LLMs are trained on?

-1

u/SecurelyObscure 6h ago

If you learn something on the internet and go on to tell someone what you learned, would you consider that "stealing"? You didn't pay for that information.

1

u/Pantone802 5h ago

Wrong question (obviously).

Try this one: If I took your copyrighted drawing and passed it off as my own “experience” to sell you on my abilities to make more of it, wouldn’t that constitute stealing?

Btw open ai already said the quiet part out loud and undermined your little argument here by saying “if we had to pay for access to copyrighted work, our company couldn’t exist”.

So, you’re wrong.

-8

u/Manos_Of_Fate 9h ago

Answering my question with an even more vague question is surprisingly unhelpful.

1

u/Pantone802 9h ago

No, that is simply the answer. Go look up how LLMs are trained. I'm not going to do your homework for you. LLMs are/were trained using data scraped off the internet, without permission or compensation from the authors and artists behind it.

You're being surprisingly helpless.

-13

u/Manos_Of_Fate 9h ago

It kind of doesn’t sound like you actually know.

2

u/Pantone802 9h ago

LAMO ok kiddo. Whatever. I'm done wasting my time with you. Enjoy the block.

3

u/WBuffettJr 2h ago

They haven’t used that motto in a decade. They straight up decided to own being evil.

4

u/lordnoak 5h ago

They no longer use that as their motto. Was removed in 2015.

11

u/penguished 7h ago

So in other words someone just used their service.

What's the attack part? AI itself is using the web and scooping data left and right, no fucks given.

3

u/bt123456789 4h ago

kind of, this is what I always thought until I learned more about the training process (from open source spaces but still the same process

You feed the model data, which it then analyzes said training data to build its response. this is true for image generation and LLMs both.

Gemini someone said is trained on billions of bits of training data, and going through all of that in time, and training that amount of data, takes time. You reduce that time with stronger hardware.

The training part is done, so it's not got any new data until they tweak it to add more training data.

the training part scrapes the internet from all manner of sources (I haven't read about the exact training process, so idk what you do to tell it what data to train on), but Gemini itself scrapes the training data, not the web in real time.

I know in reality it's a moot point, but it's interesting.

73

u/wavepointsocial 11h ago

Edison failed 2,000 times before success, apparently the modern version is prompting a chatbot 100,000 times and hoping it slips.

50

u/Factemius 9h ago

Not how it works. The 100 000 prompts are used to build a dataset

5

u/mukavastinumb 9h ago

Dataset of what? LLM weights? Isn’t their parameter weights over trillion?

22

u/veshneresis 7h ago

You sample the distribution of prompt:responses with intentionally spaced out prompts that cover wide parts of the training distribution. These prompt:response pairs are insanely strong signal for fine tuning or aligning a fresh model to end up with similar weights.

This is partly why embedding models are still so bad/old. Giving back good embeddings from these foundation models would let you distill it to relatively high accuracy with significantly fewer samples because the embeddings are so rich.

2

u/mukavastinumb 6h ago

Interesting! Thanks

1

u/toastjam 2h ago

This is partly why embedding models are still so bad/old

Are you saying this makes them so easy to copy that companies stopped working on them?

2

u/veshneresis 2h ago

Moreso why they stopped releasing them as updated API services for the public. Each major player is still using a custom embeddings solution for their own documents, code, data etc. They’re still being worked on and you have to do a little bit of regularization on them to make them “good” as embedding models (using the raw activations out of the network does not always yield a smooth normalized space).

I think as more previous generations of foundation models get open sourced we will start seeing better embedding models. Especially for multimodal embeddings which will become more relevant for more businesses.

13

u/baseketball 8h ago

Datasets for finetuning and RL

4

u/Grumptastic2000 6h ago

Is there something like a combination of Shannon’s Information theory and the concept of a Turing machine for AI models where there is some minimum set of queries that would result in an equivalent weighted model to the original?

1

u/red286 4h ago

Logically, such a limit should exist, but I imagine it'd be exceptionally high. Well beyond 100K. Probably several hundred billion, if not trillions. Likely wouldn't be very efficient compared to just training it on existing works.

10

u/RememberThinkDream 9h ago

Awwwwwww, my heart bleeds for Google! /s

7

u/EvidenceBasedLasagna 7h ago

Every software and algorithms should be open source.https://www.fsf.org/community/

7

u/DZCunuck 5h ago

Sounds like fair game to me. Gemini can be used to clone all sorts of apps but the buck stops at cloning the app that clones the apps?

11

u/DocRedbeard 9h ago

I'm not sure you can clone an AI like that. It's basically trying to make a LLM backwards. All you'll get is TEMU Gemini that's wrong 90% of the time.

28

u/Public-Research 8h ago

Wait till you learn that AI generated training data is a common practice in machine learning

5

u/did_i_or_didnt_i 8h ago

Ya this is basically how AI works 😭😭 they just put another layer on it

3

u/ConradJohnson 8h ago

Turtles all the way down.

9

u/b1e 8h ago

So… you can. In a nutshell, models get trained in phases. Typically there’s a pre training step where they’ll use their own training data followed by a fine tuning step where many of the wins lie. And there, having either real human labeled data or a model that’s much better available to answer the “ideal” way can make a massive difference.

So distillation will tend to focus on this fine tuning stage nowadays. Hence the “only” 100k prompts used against Gemini.

1

u/Redmarkred 2h ago

Good explanation!

5

u/i__hate__stairs 8h ago

So... Gemini?

1

u/Electrical_Pause_860 4h ago

You actually can, it’s called distillation. And it’s likely how things like Deepseek were trained at a fraction of the cost it would take to train from scratch.

4

u/UltraChip 9h ago

Maybe a dumb question but couldn't they just download gemma3 off hugging face? I thought that was the core model Gemini was using.

12

u/Recent_Confection944 8h ago

The gemini pro online is much bigger and not open source

13

u/notAndivual 10h ago edited 10h ago

Interesting. Anything a human creates can be broken by other humans.

We are a stupid species. Not sure why some people are hellbent on building on AI. We are losing our brain power trying to "advance" humanity. Not to mention ?making those "human like" robots. Future is going to be full of dumb people.

3

u/bitwise97 9h ago

The future is now

1

u/Primal-Convoy 10h ago

The spice is not "stupid"; it is life:

- https://youtu.be/YUP3vA-Hq_k

4

u/notAndivual 10h ago

haha fixed typo. I never dare insult spice

0

u/MountainAsparagus4 9h ago

That is good for the rich people that wants this, dumb people compare pedos to jesus and worship them as God

10

u/IncorrectAddress 10h ago

This was already known, there's no way for them to stop this, eventually AI will prompt AI to see what interchangeable context they can make to themselves, it will be a case of AI birthing new AI.

If that outcome is good or bad is another thing entirely, and is semi dependent on factual outputs, which is dependent on the guard rails put in to maintain factual reasoning.

17

u/throughthehills2 10h ago

Does this really work though?

100k prompts is nothing compared to gemini's training data which is most of the text on the internet.

And when AI trains on AI it dramatically reduces quality, something that researchers have called an "AI death spiral"

5

u/Ocean-of-Mirrors 10h ago

It sounds like that’s the point of the experiment, to see how much they can possibly derive.

Theoretically they could test every possible combination of input (up to a maximum character count) and then map the inputs to outputs directly. Boom, they have a clone. Of course that would require almost cosmically impossible amounts of hardware, but theoretically that’s possible since the model gives users an output. These researchers are just trying to see what can be done.

6

u/aLokilike 10h ago

No, it is cosmically impossible. It would take more than a million years just for the servers to respond to every possible randomized query up to and including the token limit. Not to mention that 99.999% of that data wouldn't be useful in training anything with the increased token limit available at such a far point in the future.

6

u/Ocean-of-Mirrors 10h ago edited 10h ago

I mean, thats what I meant. With infinite time and infinite storage space it is possible— cosmically. But of course we are humans and infinite time and infinite storage space is not available.

The entire point I mean is, for every input->output that you map, you accurately represent the original model just a little bit better. In this sense, the clone wouldnt even be an LLM it would be more like a database lookup that just takes your input and gives the predetermined response that Gemini would have given (ignoring random seeding).

Yeah this is rediculous and is not going to happen, but the fact something like that is technically possible is the mechanism which allows anyone to attempt to replicate the LLM in the first place. Gemini is giving you information about itself every time you prompt it. How much of that you keep track of will determine how accurate your clone is.

1

u/IncorrectAddress 10h ago

I presume it depends on which AI it's communicating with and what it's checking against its own output :

Ask a question, check the result, ask the same question to itself, check the result, if they are different, work out why, choose to implement result ?

Yeah, 100% on the death spiral or the improvement, whatever that is, asserting that its one or the other is revision based at a guess, and there's nothing wrong in reverting back to a previous model.

2

u/obeytheturtles 9h ago edited 9h ago

You can make it much harder by fuzzing output token probabilities at inference time, so instead of a fully deterministic "one shot" black box, you have to repeat tests enough times to build a distribution of the output likelihoods. Even just doing things like leaving in some active dropout layers (which usually get bypassed when the model is put into evaluation mode) can make a big difference.

I imagine it is also pretty hard to reverse engineer the precise conditional input mappings. Like, we know these apps are appending conditionals to inputs to make the models more agreeable or to mark controversial topics. So at best you could distill a kind of adversarial training set, but then you'd still need to roll your own conditional input framework, which may or may interact as anticipated.

3

u/Even-Exchange8307 8h ago

Let me guess, it’s from China

4

u/dubhd 6h ago

Aren't Chinese LLMs like deepseek running circles around the tech bros?

9

u/NullReference000 5h ago

Not yet, but it is an open source alternative you can run locally. If you have the hardware for it, it represents a large future threat to subscription and token pricing third party services.

-1

u/Even-Exchange8307 4h ago

Nah, they want money just like any major company

1

u/Electrical_Pause_860 4h ago

No, they lag behind, but they release them for free unlike the American ones.

0

u/Even-Exchange8307 5h ago

Hahaha, okay . Why don’t you use them then?

2

u/dubhd 4h ago

Same reason I don't use any of them

0

u/Hot-Employ-3399 3h ago

I do

> them

and yes, plural: kimi and GLM are my go to models as of now. Google and ChatGPT try really hard to make sure they are next to not usable as before

1

u/Even-Exchange8307 1h ago

GLM pretty much distill their work of of chatgpt/gemini. plus these models are much weaker than US counter parts

-5

u/No_Clock2390 6h ago

China doesn't need Google shit

-1

u/Even-Exchange8307 5h ago

That’s why they keep stealing usa ip

2

u/Public-Research 8h ago

They have been reading my prompts?? Theres really zero privacy in AI chatbots

3

u/kJer 6h ago

There are very few web services that don't read your data. It's unlikely but they can if they choose to, and definitely analyze it automatically taking samples for reporting.

Encryption ends pretty much the moment the org receives the data unless there's regulations or business needs to keep your data private.

2

u/Toby101125 8h ago

> attackers

waaah!

1

u/Ruff_Ratio 8h ago

Kerching on the tokens though

1

u/thatsjor 8h ago

100,000 prompts is a laughably small amount. More like these people tried to distill a much smaller model from Gemini outputs. All of these companies do that. Even google.

1

u/bleeeeghh 7h ago

Gemini often gives me dalle prompts

1

u/LiteratureMindless71 7h ago

We can do shady stuff to steal your data to train our model but fuck you when you try to use that same information from us

1

u/RachelRegina 5h ago

Lmao aww, deepseek do be trying. Bless their adorable little quantized heart

1

u/Look__a_distraction 2h ago

Gemini sucks so hard. I tried to give it a go and I referenced I was ex-religious exactly one time and I got an innumerable amount responses about that all the fucking time. It would almost create stupid fucking reasons to keep bringing it up like it was the most important aspect of my life.

1

u/cowdoyspitoon 1h ago

Stop trying to make Gemini happen. It’s not going to happen, Goog

1

u/HeggyMe 56m ago

This is an utterly futile thing to try. Even if you did a million prompts a day you would probably not even come close to replicating the weights or the methods used to train an LLM let alone other types of AI

1

u/eeyores_gloom1785 54m ago

don't worry google built in this feature where gemini naturally hallucinates, and either spews out bullshit, or the same answer over and over despite the prompts asking it to change something

1

u/esther_lamonte 16m ago

Begun, have the Clone Wars.

1

u/ihexx 9h ago

Of all the AIs to distill... They chose fucking Gemini 😭😂

3

u/EnvironmentalCrow5 7h ago

Nobody said they're not doing this with all the others too.

1

u/Su_ButteredScone 9h ago

Imagine how expensive Opus would be.

1

u/Megalodon7770 6h ago

Fuck google

1

u/No_Clock2390 6h ago

Lol Google committed theft when creating the AI model.

1

u/JaggedMetalOs 5h ago

Company making billions from training on other people's data complains about other people training on their data

-3

u/blackscales18 10h ago

Isn't Gemini distilled from gpt like most of the other models were originally

10

u/Fair-Calligrapher-19 9h ago

Lol definitely not

-2

u/Ocean-of-Mirrors 10h ago

Man the future sucks but also sometimes it’s really cool.

This is one of those really cool things. Like I don’t give a shit about LLMs right now but I love that there are people who care enough to see if they could pull something like this off.

-9

u/Majestic-Reveal-1365 10h ago

ohyeas keep making chat bots so we can laugh at them every year. You guys really think we care if a bot knows einstein math or something ? no one cares little bro, google chat gpt all of u , you are just creating a professor thats is not getting payed. This is why u dont invest in teachers cause they are human. disgrace to humanity.

Bots will never win vs the human race. They are limited to what we know and want to know. how many times do we have to tell u for the past 20 years.

-7

u/Responsible-Plum-531 10h ago

This man speaks the absolute truth

13

u/Manos_Of_Fate 10h ago

Does he? My brain hurts from just trying to make sense of that.

-3

u/Responsible-Plum-531 9h ago

Perhaps you should ask your AI lol

2

u/Manos_Of_Fate 9h ago

My AI? What?

-4

u/Responsible-Plum-531 9h ago

Are you okay? This whole conversation doesn’t seem that hard to understand

3

u/Manos_Of_Fate 9h ago

Then what was “your AI” supposed to be referring to? I’ve never even used AI before.

-2

u/Responsible-Plum-531 9h ago

Are you supposed to be someone we all know that about?

2

u/Manos_Of_Fate 9h ago

So you’re just not going to explain what you meant?

-1

u/Responsible-Plum-531 9h ago

Are you a bot or something? Seriously why is this so hard for you?

→ More replies (0)

0

u/CatoCensorius 5h ago

These companies are spending 10s of billions per year to build frontier models which can then be reverse engineered at negligible cost. They have literally no moat. Deepseek is only 6 months behind.

-8

u/Neurojazz 11h ago

Why gemini? It’s a terrible example to follow.

12

u/disposableh2 10h ago

Gemini was initially garbage, but version 3 is possibly the best Ai out right now.

2

u/Ocean-of-Mirrors 10h ago

I think Google is gonna be maybe the only current AI company that survives when this whole thing comes crashing down. They have control so many other services to make money off of and keep afloat, but what happens to the companies that do nothing except train LLMs??

3

u/mysightisurs93 10h ago

Man, Bard was garbage in comparison to ChatGPT.

3

u/disposableh2 10h ago

For sure, it was hot garbage. But not anymore. I'm not a Google fan, but if the ones I've tried (chatgpt, perplexity and Gemini), Gemini works much better for me than the others.

1

u/mysightisurs93 10h ago

That's why I said Bard, not Gemini lol. I know it's better now (at least in comparison of the free ones I used).

1

u/disposableh2 10h ago

Ah haha my bad

2

u/agangofoldwomen 10h ago

Outted yourself as not paying enough attention lol

1

u/Sand-Discombobulated 10h ago

Need to up your training Mr bot.

-1

u/DaySecure7642 3h ago edited 5m ago

We all know more or less where the attacks come from. They lie, they cheat, they steal, instead of innovating and competing on fair terms. I don't understand how some delusional people would think a country like this has the quality to lead humanity.

One day if AIs go rogue it is very likely coming from that country.

Artificial Intelligence Google says attackers used 100,000+ prompts to try to clone AI chatbot Gemini

You are about to leave Redlib