You cannot have our user's data

116 points by Tomte 3 months ago

simonw 3 months ago

Blocking aggressive crawlers - whether or not they have anything to do with AI - makes complete sense to me. There are growing numbers of badly implemented crawlers out there that can rack up thousands of dollars in bandwidth expenses for sites like SourceHut.

Framing that as "you cannot have our user's data" feels misleading to me, especially when they presumably still support anonymous "git clone" operations.

immibis 3 months ago

I still maintain that since we already have this system (it's called "looking up your ISP and emailing them") where if you send spam emails, we contact your ISP, and you get kicked off the internet...
And the same system will also you get banned from your ISP if you port scan the Department of Defense...
why are we not doing the same thing against DoS attackers? Why are ISPs not hesitant to cut people off based on spam mail, but they won't do it based on DoS?
- diggan 3 months ago
  
  > why are we not doing the same thing against DoS attackers?
  The first D in DDoS stands for "distributed", meaning it comes from multiple different origins, usually hacked devices. If we start throwing off every compromised network, we'd only have a few (secure) networks left. Probably network equipment vendors would quickly have to redo their security so it actually protects people.
  So yeah, good question.
  - immibis 3 months ago
    
    AI scrapers don't exclusively use botnets. If they had to exclusively use botnets, at least they'd have to pay $1-$20 per gigabyte downloaded...
- thunderfork 3 months ago
  
  All you need to evade ISP complaints is (e.g.) a botnet of residential IPs making a few requests each, instead of one IP making a ton.
zzo38computer 3 months ago

I agree; blocking aggressive crawlers that are badly behaved, etc, is what is sense. The files that are public are public and I should expect anyone who wants a copy can get them and do what they want with them.

bee_rider 3 months ago

On the topic of licenses and LLM’s—of course, we have to applaud sourcehut at least trying to not allow all their code to be ingested by some mechanical license violation service. But, it seems like a hard game. Ultimately the job of their site is to serve code, so they can only be so restrictive.

I wonder if anyone has tried going in the opposite direction? Someone like adding to their license: “by training a machine learning algorithm trained on this source code or including data crawled from this site, you agree that your model is free to use by all, will be openly distributed, and any output generated by the model is licensed under open source terms.” (But, ya know, in bulletproof legalese). I guess most of these thieves won’t respect the bit about distributing. But at least if the model leaks or whatever, the open source community can feel free to use it without any moral conflict or legal stress.

jsheard 3 months ago

If you squint at the GPL then you could argue that every LLM is already under it, because it's a viral license and there's almost certainly some GPL code in there somewhere. I'm sure the AI companies would beg to differ though, they want a one-way street where there's zero restrictions on IP going into models, but they can dictate whatever restrictions they like on the resulting model, derived models, and model output.
I hope one of the big proprietary models leaks one day so we get to see OpenAI or Google tie themselves in knots to argue that training on libgen is fine, but distilling a leaked copy of GPT or Gemini warrants death by firing squad.
- bastardoperator 3 months ago
  
  I think the courts were pretty clear, prove damages. I'm not saying I agree in any capacity, but the AI companies went to court, and it appears they've already won.
- bee_rider 3 months ago
  
  The hope is that we can out the onus on them to start suing people or whatever, at least. The US legal system is biased toward whoever has the biggest budget of course, but the defense still gets a little bit of advantage as well.
pabs3 3 months ago

There have been copyright office rulings saying that ML model output is not copyrightable, so that last part of the suggested license seems a bit strange, since the rulings could preclude it for code at some point.
Also, it remains to be seen whether copyright law will outright allow model training without a license, or if there will be case law to make it fair use in the USA, or if models will be considered derivative works that require a license to prepare, or what other outcome will happen.
- ranger_danger 3 months ago
  
  > There have been copyright office rulings saying that ML model output is not copyrightable
  Source? If you're referring to Thaler v. Perlmutter, I would argue that is an incorrect assessment.
  > The court held that the Copyright Act requires all eligible works to be authored by a human being. Since Dr. Thaler listed the Creativity Machine, a non-human entity, as the sole author, the application was correctly denied. The court did not address the argument that the Constitution requires human authorship, nor did it consider Dr. Thaler's claim that he is the author by virtue of creating and using the Creativity Machine, as this argument was waived before the agency.
  If someone had instead argued that they themselves were the author by means of instructing an LLM to generate the output based on a prompt they made up, plus the influence of prior art (the pretrained data), which you could also argue is not only what he probably did anyway (but argued it poorly), but in a way is also how humans make art themselves... that would be a much more interesting decision IMO.
- bee_rider 3 months ago
  
  It seems like AI companies are making a major bet on the idea that the output of their models can be licensed and used in products.
- candiddevmike 3 months ago
  
  How can AI code be added to any kind of open source license, or would it just be that code that isn't covered under the license (since it's effectively public domain?)?
  - ranger_danger 3 months ago
    
    In most cases it would just be added like all other code as if it was proper/allowed until someone brought a civil suit with enough evidence to claim they had no permission to use the code AND it somehow damaged them, which could be quite difficult and prohibitively expensive.
    So I would argue in most cases people will get away with it. We must remember that the only person's opinion that matters on what's actually illegal or not is a judge's.
- mvdtnz 3 months ago
  
  What does copyright have to do with it? It's about distribution.
majorchord 3 months ago

> Someone like adding to their license
I would assume that clause would be unenforceable. They may be able to try to sue for violating the terms of the license, but I'm fairly confident they're not going to get a judge to order them to give their model away for free even if they won. And they would likely still need to show damages in order to win a contract case.
- bee_rider 3 months ago
  
  > I guess most of these thieves won’t respect the bit about distributing. But at least if the model leaks or whatever, the open source community can feel free to use it without any moral conflict or legal stress.
  The hope is to flip the script. Sure, the company might not fulfill their full obligation under the terms and conditions they agreed to (in the same way that we all agree to these “continuing to use this site means you agree to our terms and conditions,” they are agreeing to the terms and conditions by continuing to scrape the site). But, at least if a model leaks or some pirates software that was generated by one of their LLMs, they can say, well it was open source.
  - ranger_danger 3 months ago
    
    > they can say, well it was open source
    I disagree... if those terms are unenforceable, then 1. someone else's model does not become free just because they say it does, and 2. I'm not convinced that is even legal to begin with.
    Typically contracts (a license is a contract) have to be fair and mutually beneficial... I don't think a judge would agree that giving a whole model away for free just because you trained on some of their data, is fair, if there's anything even legally wrong with using said data for training in the first place.
    You would also need to show a "reasonable purpose" for such a stipulation in the contract. Giving their product away for free as a punishment doesn't sound very reasonable to me, and I don't think a judge would say so either.
    
    bee_rider 3 months ago
    
    The model isn’t given away for free as a punishment, they agreed that the model and everything it produces is free as part of the terms and conditions. If they don’t think it is mutually beneficial, they don’t have to use the site.
    If they don’t like the contract, they can try and get it invalidated, but at least it will be a distributed problem for them.
    
    ranger_danger 3 months ago
    
    > they agreed that the model and everything it produces is free as part of the terms and conditions
    > If they don’t think it is mutually beneficial, they don’t have to use the site.
    That's not how contracts work insofar as legal enforcement though. If a judge finds that clause to be unenforceable, then they wouldn't be giving anything up.
    
    bee_rider 3 months ago
    
    > If they don’t like the contract, they can try and get it invalidated, but at least it will be a distributed problem for them.

RadiozRadioz 3 months ago

From the Anubis docs

> Anubis uses a multi-threaded proof of work check to ensure that users browsers are up to date and support modern standards.

This is so not cool. Further gatekeeping websites from older browsers. That is absolutely not their call to make. My choice of browser version is entirely my decision. Web standards are already a change treadmill, this type of artificial "You must be at least Internet Explorer 11" or "this website works best in Chrome" nonsense makes it much worse.

My browser is supported by your website if it implements all the things your website needs. That is the correct test. Not: "Your User-Agent is looking at me funny!" or "The bot prevention system we chose has an arbitrary preference for particular browser versions".

Just run the thing single threaded if you have to.

Anon1096 3 months ago

>My browser is supported by your website if it implements all the things your website needs.
Well I guess your browser does not support everything needed?Being able to run a multi threaded proof of work is not the same as checking arbitrary user agents, any browser can implement it.
zzo38computer 3 months ago

By default, Anubis will not block Lynx and some other browsers that do not implement JavaScripts, but will block scrapers that claim to be Mozilla-based browsers, and many of the badly behaving ones do claim to be Mozilla based browsers, so this helps. (I do not have a browser compatible with the Anubis software, and Anubis does not bother me.)
If necessary, it would also be possible to do what powxy does which is displaying an explanation of how the proof of work is working in case you want to implement your own.
Having alternative access of some files using other protocols, might also help, too.
- ranger_danger 3 months ago
  
  So any bot author that reads your comment and switches their user agents to curl can now bypass anubis? That doesn't seem very well thought-out to me.
  - SR2Z 3 months ago
    
    The point of Anubis checking the UA string is speed. The main gate is still the proof-of-work scheme, which makes large-scale scraping impractically expensive.
  - areyourllySorry 3 months ago
    
    it's open source, they always could. anubis is just not used by enough sites to be important. ai crawlers have bigger fish to fry, like cloudflare.
xena 3 months ago

Patches are welcome!

dale_glass 3 months ago

Yeah, I don't like this.

We have an Apache licensed project. You absolutely can use it for anything you want, including AI analysis. I don't appreciate third parties deciding on their own what can be done with code that isn't theirs.

In fact I'd say AI is overall a benefit to our project, because we have a large, quite complex platform, and the fact that ChatGPT actually manages to sometimes correctly write scripts for it is quite wonderful. I think it helps new people get started.

In fact in light of the recent Github discussion I'd say I personally see this as a reason to avoid sourcehut. Sorry, but I want all the visibility I can get.

notrealyme123 3 months ago

I was surprised to not see the "/s" at the end.
Big-Tech deciding that all our work belongs to them: Good
Small Code hosting platform does not want to be farmed like a Field of Corn: Bad
- dale_glass 3 months ago
  
  Why would you expect a /s?
  I understand their standpoint: it's their infrastructure, and their bills.
  However my concerns are with my project, not with their infrastructure bills. I seek to maximize the prominence, usability and general success of my project, and as such I want it to have a presence everywhere it can have one. I want ChatGPT/copilot/etc to know it exists and to write code for it, just in case that brings in more users.
  Blocking abusive behavior? Sure. But I very specifically disagree with the blanket prohibition of "Anything used to feed a machine learning model". I do not see it being in my interest.
  - whilenot-dev 3 months ago
    
    > I seek to maximize the prominence, usability and general success of my project, and as such I want it to have a presence everywhere it can have one
    SourceHut is doing exactly how I expect them to act, and that's a good thing.
    What did you expect from SourceHut, and why didn't you take this mindset off to GitHub in the first place?
    
    dale_glass 3 months ago
    
    > What did you expect from SourceHut, and why didn't you take this mindset off to GitHub in the first place?
    I expect them (especially if they charge for it) to work in my interests as much as possible. Sure, defending themselves against abuse is fine. They have to survive to keep providing service.
    However I don't appreciate the imposition of their own philosophy to something that isn't theirs. This here:
    # Disallowed: # [...] # - Anything used to feed a machine learning model
    Is not okay with me.
    
    mrweasel 3 months ago
    
    What do you suggest they do? Or is it just the political position that's the problem? The result is the same, pretty much every single AI company is abusing sourcehut.
    They have to do something, because I pay for a service, and if I can't use it, I'm not paying in the future. If that means blocking the AI companies that's fine, they can contact me if they want to use my code, we'll figure something out.
    
    dale_glass 3 months ago
    
    I expect hosts to be neutral to the maximum possible extent.
    For example I expect a host not to have an arbitrary beef with Bing or Kagi, or to refuse to allow connections from France. Blocking can of course be rarely necessary, but what I want from a host is a blocking policy as minimal and selective as possible.
    Yes, I understand it's a lot of work and is quite inconvenient, but especially if I'm paying for a service, I'm interested in my interests, not in what's convenient for the host.
    
    eesmith 3 months ago
    
    > but especially if I'm paying for a service
    I don't believe you are paying for Sourcehut hosting, so why do you care?
    For that matter, "This has been part of our terms of service since they were originally written in 2018" so even if you are paying for hosting, why did you start using their services in the first place?
    I don't expect hosts to be neutral to the maximum possible extent. I exercise my freedom of association to select hosts which are more aligned to my beliefs.
    My beliefs, for example, say that I shouldn't use services from companies which build tools to support an apartheid state (eg, https://www.theverge.com/news/643670/microsoft-employee-prot... ), nor from companies which host those projects.
    Even if being neutral were more profitable for them and cheaper for me.
    
    dale_glass 3 months ago
    
    > I don't believe you are paying for Sourcehut hosting, so why do you care?
    I theoretically could, and it's posted here I imagine to discuss the linked post. So I am.
    > For that matter, "This has been part of our terms of service since they were originally written in 2018"
    The "No AI" bit seems to show up only in late 2024. Which I'd regard as an extremely unwelcome development had I been paying.
    > I don't expect hosts to be neutral to the maximum possible extent. I exercise my freedom of association to select hosts which are more aligned to my beliefs.
    Likewise. In my case it's my belief is that when you pay somebody, it's to get things done your way. So for instance I'd be a lot more pleased with a setting.
    
    eesmith 3 months ago
    
    Your own statement say that you would never - not even theoretically - pay for Sourcehut hosting.
    The 2018 restriction on using "this data for recruiting, solicitation, or profit" would have been an offense to your belief that restrictions should be "as minimal and selective as possible."
    
    ranger_danger 3 months ago
    
    > I expect hosts to be neutral to the maximum possible extent.
    Why? Who gets to say what's acceptable?
  - candiddevmike 3 months ago
    
    You don't get any attribution, so how can you tell if they're using it or not? This seems like a philosophical argument instead of a technical one.
    
    dale_glass 3 months ago
    
    What attribution are you talking about? Here's what I mean:
    You can go to ChatGPT and ask it: "please write a Python script that prints "Hello world" in red". And that works.
    And you can also go to ChatGPT and ask it: "Please write an Overte script that makes an object red when it's clicked".
    And I really like that this works. I certainly don't want it to stop working because Sourcehut has something against LLMs.
    
    mtlynch 3 months ago
    
    I don't think the use cases you're describing are what any critics are talking about.
    How do you feel about someone with more funding than you going to an LLM and saying, "Reimplement the entire Overte source for me, but change it superficially so that Overte has a hard time suing me for stealing their IP?"
    
    dale_glass 3 months ago
    
    We're Apache licensed. A LLM seems very overkill.
    
    candiddevmike 3 months ago
    
    I see, I encountered something similar with a DSL. For my use case, I had better results by having a LLM scrape a well formed doc reference page than a source code repo, I'd assume that same behavior extends to training data.
    
    dale_glass 3 months ago
    
    Oh, I'm sure there's all sorts of practical considerations regarding optimal LLM training.
    All the same though, I don't like my host being so opinionated. I don't want a host that has something against any of the common search engines, and I don't want a host that has something against LLMs. Hosts should be as neutral as possible.
maleldil 3 months ago

This would be fine in an ideal world. However, the one we live in has crawlers that don't care how many resources they use. They're fine with taking the server down or bankrupting the owner as long as they get the data they want.
- dale_glass 3 months ago
  
  And I can understand the abuse argument, however they have a blanket exclusion for AI I do not agree with.
sksxihve 3 months ago

The code might not be theirs but the service hosting the code is and nothing is stopping you from hosting your code elsewhere. For some people blocking LLMs might be a reason to use sourcehut over github.
mtlynch 3 months ago

>We have an Apache licensed project. You absolutely can use it for anything you want, including AI analysis. I don't appreciate third parties deciding on their own what can be done with code that isn't theirs.
That's not what the Apache license says.
According to the Apache license, derivative work needs to provide attribution and a copyright notice, which no major LLMs do.
eesmith 3 months ago

I pay for sourcehut hosting, and I have no problems at all with this decision.
Unlike you with your GitHub-based project, I avoid Microsoft like the plague. I do not want to be complicit in supporting their monopoly power, their human rights abuses, and their environmental destruction.
LtWorf 3 months ago

You won't get visibility from AI.
I'm curious what your project is. Blockchain?
- dale_glass 3 months ago
  
  VR platform. We're actually opposed to any blockchain tech as an organization. We're into it for the users.
  But I see no reason to have any issues with LLMs. ChatGPT/copilot/etc helping new people getting started? That sounds absolutely great to me.
  - plsbenice34 3 months ago
    
    [flagged]
    
    dale_glass 3 months ago
    
    We're effectively open source VR chat, if you want to get an idea of what it's like.
    Crypto-focused projects of this kind include Decentraland. Which tend to devolve into things like selling virtual land. That's not our jam and doesn't align with the way the project works anyway -- you can have all the land you want, and set things up for free on your own server.
    I don't have any more issues with ChatGPT than I have with Google and Kagi. All of those are closed source projects. But by all means, I love open source, so if an open source LLM can do something like writing code for our platform, that'd be wonderful.
    
    plsbenice34 3 months ago
    
    I see, now it makes sense for me with that context
mrweasel 3 months ago

> Sorry, but I want all the visibility I can get.
I can understand that, but the various AI companies pounding sourcehut into the ground also results in zero visibility.

matt3210 3 months ago

Anubis has had great results blocking LLM agents https://anubis.techaro.lol/

ac29 3 months ago

That's what sourcehut is using.
As an aside, I saw this for the first time on a kernel.org website a few days ago and actually thought it might have been hacked since I briefly saw something about kH/s (which screams cryptominer to me).
A screenshot for anyone who hasnt seen it: https://i.imgur.com/dHOmHtn.png
(this screen appears only very briefly, so while it is clear what it is from a static screenshot, its very hard to tell in real time)
runjake 3 months ago

Yes, this is explained and linked in the first sentence of the linked article.

xvilka 3 months ago

The solution is to make Git fully self-contained and encrypted, just like Fossil[1] - store issues and PRs inside the repository itself, truly distributed system.

[1] https://fossil-scm.org/

rvba-fr 3 months ago

Looks like git diffs are the new gold for training LLMs : https://carper.ai/diff-models-a-new-way-to-edit-code/

sltr 3 months ago

> a racketeer like CloudFlare

Could anyone teach me what makes this a fair characterization of Cloudflare?

diggan 3 months ago

Not sure exactly what it is referring to, but I could make a guess that it's because Cloudflare sells LLM inference as a service, but also a service that blocks LLMs. A bit like a Anti-DDOS company also selling DDOS services.
For example, https://developers.cloudflare.com/workers-ai/guides/demos-ar... has examples visit websites, then for the people on the other side (who want to protect themselves against those visits) there is https://developers.cloudflare.com/waf/detections/firewall-fo...
Just a guess though, I don't know for sure the authors intentions/meanings.
- jsheard 3 months ago
  > A bit like a Anti-DDOS company also selling DDOS services.
  That's not far off from what Cloudflare does either, the majority of DDoS-for-hire outfits are hidden behind and protected by Cloudflare.
  Whenever this comes up I do a quick survey of the top DDG results for "stresser" to see if anything's changed, and sure enough:
  mrstresser.com. 21600 IN NS sterling.ns.cloudflare.com. silentstress.cc. 21472 IN NS ernest.ns.cloudflare.com. maxstresser.com. 21600 IN NS edna.ns.cloudflare.com. darkvr.su. 21600 IN NS paige.ns.cloudflare.com. stresser.sh. 21600 IN NS luke.ns.cloudflare.com. stresserhub.org. 21600 IN NS fay.ns.cloudflare.com.
- rsync 3 months ago
  
  "Just a guess though, I don't know for sure the authors intentions/meanings."
  I am reminded of this posting from years past:
  https://news.ycombinator.com/item?id=38496499
  "A lot of hacking groups, terror organizations and other malicious actors have been using cloud flare for a while without them doing shit about it. ... It's their business model. More DDoS means more cloudflare customers, yaaay."
  I've not spent much time on this topic but I am very interested in the notion that a well-meaning third party established some kind of looking glass that surveyed cloudflare behavior and that third party was sued ?
  I'd like to learn more about that situation ...
mariusor 3 months ago

I remember that when the first influxes of LLM crawlers have hit Sourcehut, they had some talks with Cloudflare which ended when CF demanded an outrageous amount of money from a company the size of Sourcehut. If I find the source for this, I'll update.
[edit] Here's the source: https://sourcehut.org/blog/2024-01-19-outage-post-mortem/#:~...
candiddevmike 3 months ago

Cloudflare has been accused of playing both sides--they host services for known/associated DDoS providers while conveniently offering services to protect DDoS.

mvdtnz 3 months ago

What I don't understand is why these scrapers so aggressively scrape websites which barely change. How much value is OpenAI etc getting from hammering the ever-living shit out of my website thousands of times a day when the content changes weekly at most? I truly don't understand the tactic. Surely their resources are better spent elsewhere?

imglorp 3 months ago

I would like to see a scraper tarpit that provides an endless stream of sub pages all filled with model training poison. Enough inaccurate or inappropriate material from enough tarpits will make this practice less profitable.

RadiozRadioz 3 months ago

How about use this to contribute an absolutely tiny amount of hashes to a mining pool on behalf of the website owner, instead of just burning the energy

frohwayyy_123 3 months ago

> All features work without JavaScript

Maybe they should update their bullet points...

The footnote saying "fuck you now, maybe come back later" is really encouraging.

immibis 3 months ago

How sure are we that they're actually LLM scrapers and not just someone trying to DDoS source hut with plausible deniability?

sksxihve 3 months ago

The LLM scrapers could publish the ip ranges they use for scraping like google does, but that would make it easier to block them so they probably wouldn't do that.
https://developers.google.com/search/docs/crawling-indexing/...

M95D 3 months ago

> Anubis uses a multi-threaded proof of work check to ensure that users browsers are up to date and support modern standards.

OMG! Yet another firewall telling me what browser and OS to use.

efitz 3 months ago

[flagged]

j5155 3 months ago

Use of published information is still always constrained by copyright law. If I had a copyrighted movie playing on my television visible through the window, and you recorded that, redistributing that recording would unambiguously be a violation of copyright law and piracy.
I’m also a little confused by what you’re saying here; are you asking whether scraper bots are illegal, or whether they’re immoral/unethical?
- efitz 3 months ago
  
  Looking through your window is already covered by a lot of laws (it’s legal sometimes in some places if you didn’t take reasonable effort to prevent it [like closing the blinds], and as long as there was no trespass). Of If I captured a small enough section of a video - say one frame in a photo- that likely is fair use. It is not crystal clear.
  I’m getting to the ethical aspect but also trying to be pragmatic. “Publishing a bunch of information on the internet accessible without authentication” is an action that is fundamentally incompatible with controlling the use of that information.
  The law cannot substitute for common sense; criminals are gonna crime.
GauntletWizard 3 months ago

Your drone makes a loud buzzing sound, and blocks the street for anyone else trying to get in, and does not move on its own. This is where it escalates from "Taking advantage of public information" to "Harassment".
Crawling source hut once is public information. Crawling it once a day using deltas might still be that. What these AI companies are doing is not that.

sneak 3 months ago

Pretending that published data isn’t public is a fool’s errand.

The point of a web host is to serve the users’ data to the public.

Anything else means the web host is broken.

diggan 3 months ago

I feel like we're part of a dying generation or something. I keep seeing people who want to post content to the public internet, but they still want to own the data somehow, and control who see it, but still on the public internet.
I'm not sure how it's supposed to work, as I see the public internet as just that, a public square. What goes there can be picked up by anyone, for any purpose. If I want something to be secret, I don't put it on the public internet.
Gonna be interesting to see how that "public but in my control" movement continues to evolve, because it feels like they're climbing an impossible wall.
- the_other 3 months ago
  
  In the music biz, copyright is sometimes used to prevent very specific uses of material. Recent examples include a number of musicians (or perhaps their labels or publishing houses) denying the use of certain pieces of music at rallies for particular politician.
  IANAL, but my read of this is that if the content has the appropriate licence, the licence holder can withhold certain rights & access from certain groups of potential licensees. I'm loosely aware that the common open source licences are highly permissive, so probably they can't be used in this way... but presumably not all licences are like that. So, even though the work is "public", it should still be possible to enforce subsets of rights.
  And to take your "public square" analogy... before we had cameras everywhere, there was some expectation of "casual" privacy even in public spaces. Not everyone in the square hears everything said by everyone else in the square. The fact that digital tools make privacy breaches much easier doesn't mean it should be tolerated.
  (that said, I'm fairly careful what I publish online)
- geocar 3 months ago
  
  > I'm not sure how it's supposed to work
  Laws.
  This one is called copyright.
  > Gonna be interesting to see how that "public but in my control" movement continues to evolve
  Berne convention is almost 150 years old now...
  - pixl97 3 months ago
    
    Yes, laws work just great over the global public internet.
    
    mrweasel 3 months ago
    
    You have to actually utilize the legal system. Other suggest attaching licenses to they published content/data preventing AI training, but attaching the license alone does nothing. You have to actually drag someone to court once in a while for it to work.
    It's the same complaint about the GDPR, if it works why are site still doing X/Y/Z... Well because all people do is complain online, you need to report violations and be prepared to take legal action.
    
    pixl97 3 months ago
    
    As someone that ran SMTP systems for years legal complaints only work against people in your country and those it works with.
    "Dear Russia, pwetty pweeese don't hammer my server to death stealing everything in sight"
    [Crickets]
    Russian IP ban
    
    geocar 3 months ago
    
    > legal complaints only work against people in your country and those it works with.
    Well yes. Process Service tends to be quite different from country to country, and if you don't know how it works there, you probably won't be able to make vague "legal complaints" and have them taken seriously.
    If you really want to make a legal complaint in Russia, I would suggest you look into what is called a Process Service Specialist who has specific experience with Arbitrazh.
    
    mrweasel 3 months ago
    
    I get what your saying, and I do question the value of suing someone in Russia, or China, but did you actually get a lawyer file an actual real lawsuit in Russia? Again, Russia, probably no really going to work.
    There's absolutely no reason why you couldn't drag OpenAI to court... you'd need a ton of money, but you could and if you win, then the rest of the AI companies are going to get very busy adjusting their behaviour.
  - sneak 3 months ago
    
    Copyright is an anachronistic concept that doesn’t really apply to the internet, as demonstrated by IA, TPB, and OpenAI.
- mjevans 3 months ago
  
  https://en.wikipedia.org/wiki/Tragedy_of_the_commons ; people could be nicer about their use of public spaces / resources.
  Though that also goes towards those posting content. Systems that generate an infinite number of permutations for viewing the same information are a poor design. It can easily lead to even conscientious people discovering that a simple attempt to slowly mirror a website overnight with wget has resulted in some rabbit hole of forum view parameter explosion.
- prologic 3 months ago
  
  I'm one of those people. I think it comes down to the intent. There is an implicit good will of those that do this that the data isn't abused or the infrastructure behind it overwhelmed (self-hosting). "Big Tech" just make this worse, because their motivations aren't the same as ours (small web).
- panzagl 3 months ago
  
  The public internet is a public square, but does it have to be some caricature 1970's Times Square dominated by dealers, pimps, and thugs all looking to extract whatever they can from me?
- elpocko 3 months ago
  
  Once all devices are locked down and all control has been taken away from the users, we can finally have functional DRM on every device and make this dream come true.
  > I feel like we're part of a dying generation or something
  Well... yes. Just like every living thing that's ever existed ;)
- sltr 3 months ago
  
  > public but in my control
  I think we have a word for that - copyright
  - diggan 3 months ago
    
    But even copyright doesn't give you "control" over something. I can still download it and use it however I want, privately, just that I'm limited legally to further distribute it. Unless I change it sufficiently, then it's again OK to distribute my changed one. Of course, depending on country and whatnot.
    Problem remains the same, as soon as the content is visible on the public internet, you lose 100% control of it's propagation, illegal or not.
    
    dijksterhuis 3 months ago
    
    > I can still download it and use it however I want, privately, just that I'm limited legally to further distribute it
    the party that allowed you to download it committed copyright infringement by distributing it. the same restrictions apply to them as to you*. a copyright holder may well want that party to stop distributing it, which means no-one else can download it.
    it’s just moving the ‘control’ up a level to a different party. it’s still there.
    > as soon as the content is visible on the public internet, you lose 100% control of it's propagation, illegal or not.
    GPL license: thou shall always* provide access to this software’s source code —> form of control in the opposite direction to music copyright, but still a form of control
    * subject to terms, conditions, locale and other legal things (IANAL)
    
    LegionMammal978 3 months ago
    
    > the party that allowed you to download it committed copyright infringement by distributing it. the same restrictions apply to them as to you*. a copyright holder may well want that party to stop distributing it, which means no-one else can download it.
    That's why every social media site in existence puts terms in its EULA demanding that users grant the site a blanket license to redistribute their content, over and above any separate licenses they may put on it. After it's been redistributed to third parties, the copyright holder has no more control (at least, not via copyright law) over how those copies are privately used.
    E.g., on HN: "With respect to the content or other materials you upload through the Site or share with other users or recipients (collectively, 'User Content'), you represent and warrant that you own all right, title and interest in and to such User Content, including, without limitation, all copyrights and rights of publicity contained therein. By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose in any form, medium or technology now known or later developed."
- 01HNNWZ0MV43FF 3 months ago
  
  For Anubis it makes sense.
  If you post something on a physical bulletin board, you expect people will come by and read it
  If a bunch of "scrapers" come by and form a massive mob so that nobody else can read it, and then cover the board with slightly different copies attributed to themselves, that isn't exactly the "public square" you imagined
  - diggan 3 months ago
    
    > If you post something on a physical bulletin board, you expect people will come by and read it
    Or birds, or public cameras, or people who take photos next to it, or people archiving bulletin boards, or...
    If I put stuff in public spaces, I expect anyone and anything to be able to read, access and store it. Basically how I treat this very comment, and everything else I put on the public internet.
    
    pixl97 3 months ago
    
    Eh, the internet is a weirdly semi-public space. Think of it more like a mall parking lot than a public common. If you put something up for notice there it will probably be fine. For example a number of grocery stores around me have boards for things just like that. But the moment it becomes a public nuisance for them they will trespass your ass outta there faster than a starved dog would eat a dropped hotdog.
    
    skydhash 3 months ago
    
    As you say, it's not public as in public road (and even that has police to enforce proper behavior), but more like publicly accessible but on private properties.
    When it's my server and I'm paying for it, then banning resources wasters is the right move.
- sneak 3 months ago
  
  You can’t have your cake and eat it, too. The power of the internet and mass media is that you can publish to the whole world, and make something public to billions of people. With that power comes side effects, such as, obviously, billions of people being able to privately do whatever they want with the information you, you know, published.
  “public, but, no, not like that” isn’t a thing and no technological measure can make it a thing.
- SoftTalker 3 months ago
  
  Have to agree. If you want to limit who sees your content, put it behind a paywall or some sort of subscription.
nancyminusone 3 months ago

I don't know, this is like saying that because those bowls of mints at a restaurant are "free", I can back a trailer up to the door and start loading it up. Even if you know they'll never run out of mints.
mariusor 3 months ago

I feel like you need to present a very strong case where LLMs are "the public" before you take such a weak position when interpreting the entirety of the article.
Drew makes it perfectly clear in TFA that "the public", as he sees it, is fully entitled and should make use of the data SourceHut provides.
- sneak 3 months ago
  
  LLMs are just tools, run by human beings who are naturally members of the public. There is no confusion or ambiguity here.
  - kweingar 3 months ago
    
    So are DDoS scripts
nottorp 3 months ago

But one can argue that the LLM crawlers deny the rest of the public access to your data by consuming all available bandwidth.
sksxihve 3 months ago

If the AI scrapers respected the robots.txt file then this wouldn't be an issue. A company is allowed to set the terms of service for their service and take action if other companies are abusing that.
LtWorf 3 months ago

Opening a bakery and feeding the entire world aren't the same.
- SoftTalker 3 months ago
  
  Then put your content behind a paywall. Bakeries typically aren't free.