On the topic of licenses and LLM’s—of course, we have to applaud sourcehut at least trying to not allow all their code to be ingested by some mechanical license violation service. But, it seems like a hard game. Ultimately the job of their site is to serve code, so they can only be so restrictive.
I wonder if anyone has tried going in the opposite direction? Someone like adding to their license: “by training a machine learning algorithm trained on this source code or including data crawled from this site, you agree that your model is free to use by all, will be openly distributed, and any output generated by the model is licensed under open source terms.” (But, ya know, in bulletproof legalese). I guess most of these thieves won’t respect the bit about distributing. But at least if the model leaks or whatever, the open source community can feel free to use it without any moral conflict or legal stress.
If you squint at the GPL then you could argue that every LLM is already under it, because it's a viral license and there's almost certainly some GPL code in there somewhere. I'm sure the AI companies would beg to differ though, they want a one-way street where there's zero restrictions on IP going into models, but they can dictate whatever restrictions they like on the resulting model, derived models, and model output.
I hope one of the big proprietary models leaks one day so we get to see OpenAI or Google tie themselves in knots to argue that training on libgen is fine, but distilling a leaked copy of GPT or Gemini warrants death by firing squad.
I think the courts were pretty clear, prove damages. I'm not saying I agree in any capacity, but the AI companies went to court, and it appears they've already won.
The hope is that we can out the onus on them to start suing people or whatever, at least. The US legal system is biased toward whoever has the biggest budget of course, but the defense still gets a little bit of advantage as well.
I would assume that clause would be unenforceable. They may be able to try to sue for violating the terms of the license, but I'm fairly confident they're not going to get a judge to order them to give their model away for free even if they won. And they would likely still need to show damages in order to win a contract case.
There have been copyright office rulings saying that ML model output is not copyrightable, so that last part of the suggested license seems a bit strange, since the rulings could preclude it for code at some point.
Also, it remains to be seen whether copyright law will outright allow model training without a license, or if there will be case law to make it fair use in the USA, or if models will be considered derivative works that require a license to prepare, or what other outcome will happen.
How can AI code be added to any kind of open source license, or would it just be that code that isn't covered under the license (since it's effectively public domain?)?
In most cases it would just be added like all other code as if it was proper/allowed until someone brought a civil suit with enough evidence to claim they had no permission to use the code AND it somehow damaged them, which could be quite difficult and prohibitively expensive.
So I would argue in most cases people will get away with it. We must remember that the only person's opinion that matters on what's actually illegal or not is a judge's.
> There have been copyright office rulings saying that ML model output is not copyrightable
Source? If you're referring to Thaler v. Perlmutter, I would argue that is an incorrect assessment.
> The court held that the Copyright Act requires all eligible works to be authored by a human being. Since Dr. Thaler listed the Creativity Machine, a non-human entity, as the sole author, the application was correctly denied. The court did not address the argument that the Constitution requires human authorship, nor did it consider Dr. Thaler's claim that he is the author by virtue of creating and using the Creativity Machine, as this argument was waived before the agency.
If someone had instead argued that they themselves were the author by means of instructing an LLM to generate the output based on a prompt they made up, plus the influence of prior art (the pretrained data), which you could also argue is not only what he probably did anyway (but argued it poorly), but in a way is also how humans make art themselves... that would be a much more interesting decision IMO.
Blocking aggressive crawlers - whether or not they have anything to do with AI - makes complete sense to me. There are growing numbers of badly implemented crawlers out there that can rack up thousands of dollars in bandwidth expenses for sites like SourceHut.
Framing that as "you cannot have our user's data" feels misleading to me, especially when they presumably still support anonymous "git clone" operations.
I agree; blocking aggressive crawlers that are badly behaved, etc, is what is sense. The files that are public are public and I should expect anyone who wants a copy can get them and do what they want with them.
I still maintain that since we already have this system (it's called "looking up your ISP and emailing them") where if you send spam emails, we contact your ISP, and you get kicked off the internet...
And the same system will also you get banned from your ISP if you port scan the Department of Defense...
why are we not doing the same thing against DoS attackers? Why are ISPs not hesitant to cut people off based on spam mail, but they won't do it based on DoS?
> Anubis uses a multi-threaded proof of work check to ensure that users browsers are up to date and support modern standards.
This is so not cool. Further gatekeeping websites from older browsers. That is absolutely not their call to make. My choice of browser version is entirely my decision. Web standards are already a change treadmill, this type of artificial "You must be at least Internet Explorer 11" or "this website works best in Chrome" nonsense makes it much worse.
My browser is supported by your website if it implements all the things your website needs. That is the correct test. Not: "Your User-Agent is looking at me funny!" or "The bot prevention system we chose has an arbitrary preference for particular browser versions".
Just run the thing single threaded if you have to.
>My browser is supported by your website if it implements all the things your website needs.
Well I guess your browser does not support everything needed?Being able to run a multi threaded proof of work is not the same as checking arbitrary user agents, any browser can implement it.
By default, Anubis will not block Lynx and some other browsers that do not implement JavaScripts, but will block scrapers that claim to be Mozilla-based browsers, and many of the badly behaving ones do claim to be Mozilla based browsers, so this helps. (I do not have a browser compatible with the Anubis software, and Anubis does not bother me.)
If necessary, it would also be possible to do what powxy does which is displaying an explanation of how the proof of work is working in case you want to implement your own.
Having alternative access of some files using other protocols, might also help, too.
We have an Apache licensed project. You absolutely can use it for anything you want, including AI analysis. I don't appreciate third parties deciding on their own what can be done with code that isn't theirs.
In fact I'd say AI is overall a benefit to our project, because we have a large, quite complex platform, and the fact that ChatGPT actually manages to sometimes correctly write scripts for it is quite wonderful. I think it helps new people get started.
In fact in light of the recent Github discussion I'd say I personally see this as a reason to avoid sourcehut. Sorry, but I want all the visibility I can get.
I understand their standpoint: it's their infrastructure, and their bills.
However my concerns are with my project, not with their infrastructure bills. I seek to maximize the prominence, usability and general success of my project, and as such I want it to have a presence everywhere it can have one. I want ChatGPT/copilot/etc to know it exists and to write code for it, just in case that brings in more users.
Blocking abusive behavior? Sure. But I very specifically disagree with the blanket prohibition of "Anything used to feed a machine learning model". I do not see it being in my interest.
> What did you expect from SourceHut, and why didn't you take this mindset off to GitHub in the first place?
I expect them (especially if they charge for it) to work in my interests as much as possible. Sure, defending themselves against abuse is fine. They have to survive to keep providing service.
However I don't appreciate the imposition of their own philosophy to something that isn't theirs. This here:
# Disallowed:
# [...]
# - Anything used to feed a machine learning model
What do you suggest they do? Or is it just the political position that's the problem? The result is the same, pretty much every single AI company is abusing sourcehut.
They have to do something, because I pay for a service, and if I can't use it, I'm not paying in the future. If that means blocking the AI companies that's fine, they can contact me if they want to use my code, we'll figure something out.
I expect hosts to be neutral to the maximum possible extent.
For example I expect a host not to have an arbitrary beef with Bing or Kagi, or to refuse to allow connections from France. Blocking can of course be rarely necessary, but what I want from a host is a blocking policy as minimal and selective as possible.
Yes, I understand it's a lot of work and is quite inconvenient, but especially if I'm paying for a service, I'm interested in my interests, not in what's convenient for the host.
I don't believe you are paying for Sourcehut hosting, so why do you care?
For that matter, "This has been part of our terms of service since they were originally written in 2018" so even if you are paying for hosting, why did you start using their services in the first place?
I don't expect hosts to be neutral to the maximum possible extent. I exercise my freedom of association to select hosts which are more aligned to my beliefs.
> I don't believe you are paying for Sourcehut hosting, so why do you care?
I theoretically could, and it's posted here I imagine to discuss the linked post. So I am.
> For that matter, "This has been part of our terms of service since they were originally written in 2018"
The "No AI" bit seems to show up only in late 2024. Which I'd regard as an extremely unwelcome development had I been paying.
> I don't expect hosts to be neutral to the maximum possible extent. I exercise my freedom of association to select hosts which are more aligned to my beliefs.
Likewise. In my case it's my belief is that when you pay somebody, it's to get things done your way. So for instance I'd be a lot more pleased with a setting.
I don't think the use cases you're describing are what any critics are talking about.
How do you feel about someone with more funding than you going to an LLM and saying, "Reimplement the entire Overte source for me, but change it superficially so that Overte has a hard time suing me for stealing their IP?"
I see, I encountered something similar with a DSL. For my use case, I had better results by having a LLM scrape a well formed doc reference page than a source code repo, I'd assume that same behavior extends to training data.
Oh, I'm sure there's all sorts of practical considerations regarding optimal LLM training.
All the same though, I don't like my host being so opinionated. I don't want a host that has something against any of the common search engines, and I don't want a host that has something against LLMs. Hosts should be as neutral as possible.
The code might not be theirs but the service hosting the code is and nothing is stopping you from hosting your code elsewhere. For some people blocking LLMs might be a reason to use sourcehut over github.
This would be fine in an ideal world. However, the one we live in has crawlers that don't care how many resources they use. They're fine with taking the server down or bankrupting the owner as long as they get the data they want.
>We have an Apache licensed project. You absolutely can use it for anything you want, including AI analysis. I don't appreciate third parties deciding on their own what can be done with code that isn't theirs.
That's not what the Apache license says.
According to the Apache license, derivative work needs to provide attribution and a copyright notice, which no major LLMs do.
I pay for sourcehut hosting, and I have no problems at all with this decision.
Unlike you with your GitHub-based project, I avoid Microsoft like the plague. I do not want to be complicit in supporting their monopoly power, their human rights abuses, and their environmental destruction.
We're effectively open source VR chat, if you want to get an idea of what it's like.
Crypto-focused projects of this kind include Decentraland. Which tend to devolve into things like selling virtual land. That's not our jam and doesn't align with the way the project works anyway -- you can have all the land you want, and set things up for free on your own server.
I don't have any more issues with ChatGPT than I have with Google and Kagi. All of those are closed source projects. But by all means, I love open source, so if an open source LLM can do something like writing code for our platform, that'd be wonderful.
As an aside, I saw this for the first time on a kernel.org website a few days ago and actually thought it might have been hacked since I briefly saw something about kH/s (which screams cryptominer to me).
The solution is to make Git fully self-contained and encrypted, just like Fossil[1] - store issues and PRs inside the repository itself, truly distributed system.
Not sure exactly what it is referring to, but I could make a guess that it's because Cloudflare sells LLM inference as a service, but also a service that blocks LLMs. A bit like a Anti-DDOS company also selling DDOS services.
> A bit like a Anti-DDOS company also selling DDOS services.
That's not far off from what Cloudflare does either, the majority of DDoS-for-hire outfits are hidden behind and protected by Cloudflare.
Whenever this comes up I do a quick survey of the top DDG results for "stresser" to see if anything's changed, and sure enough:
mrstresser.com. 21600 IN NS sterling.ns.cloudflare.com.
silentstress.cc. 21472 IN NS ernest.ns.cloudflare.com.
maxstresser.com. 21600 IN NS edna.ns.cloudflare.com.
darkvr.su. 21600 IN NS paige.ns.cloudflare.com.
stresser.sh. 21600 IN NS luke.ns.cloudflare.com.
stresserhub.org. 21600 IN NS fay.ns.cloudflare.com.
"A lot of hacking groups, terror organizations and other malicious actors have been using cloud flare for a while without them doing shit about it. ... It's their business model. More DDoS means more cloudflare customers, yaaay."
I've not spent much time on this topic but I am very interested in the notion that a well-meaning third party established some kind of looking glass that surveyed cloudflare behavior and that third party was sued ?
I remember that when the first influxes of LLM crawlers have hit Sourcehut, they had some talks with Cloudflare which ended when CF demanded an outrageous amount of money from a company the size of Sourcehut. If I find the source for this, I'll update.
Cloudflare has been accused of playing both sides--they host services for known/associated DDoS providers while conveniently offering services to protect DDoS.
How about use this to contribute an absolutely tiny amount of hashes to a mining pool on behalf of the website owner, instead of just burning the energy
What I don't understand is why these scrapers so aggressively scrape websites which barely change. How much value is OpenAI etc getting from hammering the ever-living shit out of my website thousands of times a day when the content changes weekly at most? I truly don't understand the tactic. Surely their resources are better spent elsewhere?
The LLM scrapers could publish the ip ranges they use for scraping like google does, but that would make it easier to block them so they probably wouldn't do that.
Use of published information is still always constrained by copyright law. If I had a copyrighted movie playing on my television visible through the window, and you recorded that, redistributing that recording would unambiguously be a violation of copyright law and piracy.
I’m also a little confused by what you’re saying here; are you asking whether scraper bots are illegal, or whether they’re immoral/unethical?
Looking through your window is already covered by a lot of laws (it’s legal sometimes in some places if you didn’t take reasonable effort to prevent it [like closing the blinds], and as long as there was no trespass). Of If I captured a small enough section of a video - say one frame in a photo- that likely is fair use. It is not crystal clear.
I’m getting to the ethical aspect but also trying to be pragmatic. “Publishing a bunch of information on the internet accessible without authentication” is an action that is fundamentally incompatible with controlling the use of that information.
The law cannot substitute for common sense; criminals are gonna crime.
Your drone makes a loud buzzing sound, and blocks the street for anyone else trying to get in, and does not move on its own. This is where it escalates from "Taking advantage of public information" to "Harassment".
Crawling source hut once is public information. Crawling it once a day using deltas might still be that. What these AI companies are doing is not that.
I feel like we're part of a dying generation or something. I keep seeing people who want to post content to the public internet, but they still want to own the data somehow, and control who see it, but still on the public internet.
I'm not sure how it's supposed to work, as I see the public internet as just that, a public square. What goes there can be picked up by anyone, for any purpose. If I want something to be secret, I don't put it on the public internet.
Gonna be interesting to see how that "public but in my control" movement continues to evolve, because it feels like they're climbing an impossible wall.
You have to actually utilize the legal system. Other suggest attaching licenses to they published content/data preventing AI training, but attaching the license alone does nothing. You have to actually drag someone to court once in a while for it to work.
It's the same complaint about the GDPR, if it works why are site still doing X/Y/Z... Well because all people do is complain online, you need to report violations and be prepared to take legal action.
I get what your saying, and I do question the value of suing someone in Russia, or China, but did you actually get a lawyer file an actual real lawsuit in Russia? Again, Russia, probably no really going to work.
There's absolutely no reason why you couldn't drag OpenAI to court... you'd need a ton of money, but you could and if you win, then the rest of the AI companies are going to get very busy adjusting their behaviour.
In the music biz, copyright is sometimes used to prevent very specific uses of material. Recent examples include a number of musicians (or perhaps their labels or publishing houses) denying the use of certain pieces of music at rallies for particular politician.
IANAL, but my read of this is that if the content has the appropriate licence, the licence holder can withhold certain rights & access from certain groups of potential licensees. I'm loosely aware that the common open source licences are highly permissive, so probably they can't be used in this way... but presumably not all licences are like that. So, even though the work is "public", it should still be possible to enforce subsets of rights.
And to take your "public square" analogy... before we had cameras everywhere, there was some expectation of "casual" privacy even in public spaces. Not everyone in the square hears everything said by everyone else in the square. The fact that digital tools make privacy breaches much easier doesn't mean it should be tolerated.
(that said, I'm fairly careful what I publish online)
Though that also goes towards those posting content. Systems that generate an infinite number of permutations for viewing the same information are a poor design. It can easily lead to even conscientious people discovering that a simple attempt to slowly mirror a website overnight with wget has resulted in some rabbit hole of forum view parameter explosion.
The public internet is a public square, but does it have to be some caricature 1970's Times Square dominated by dealers, pimps, and thugs all looking to extract whatever they can from me?
I'm one of those people. I think it comes down to the intent. There is an implicit good will of those that do this that the data isn't abused or the infrastructure behind it overwhelmed (self-hosting). "Big Tech" just make this worse, because their motivations aren't the same as ours (small web).
Once all devices are locked down and all control has been taken away from the users, we can finally have functional DRM on every device and make this dream come true.
> I feel like we're part of a dying generation or something
Well... yes. Just like every living thing that's ever existed ;)
But even copyright doesn't give you "control" over something. I can still download it and use it however I want, privately, just that I'm limited legally to further distribute it. Unless I change it sufficiently, then it's again OK to distribute my changed one. Of course, depending on country and whatnot.
Problem remains the same, as soon as the content is visible on the public internet, you lose 100% control of it's propagation, illegal or not.
> I can still download it and use it however I want, privately, just that I'm limited legally to further distribute it
the party that allowed you to download it committed copyright infringement by distributing it. the same restrictions apply to them as to you*. a copyright holder may well want that party to stop distributing it, which means no-one else can download it.
it’s just moving the ‘control’ up a level to a different party. it’s still there.
> as soon as the content is visible on the public internet, you lose 100% control of it's propagation, illegal or not.
GPL license: thou shall always* provide access to this software’s source code —> form of control in the opposite direction to music copyright, but still a form of control
* subject to terms, conditions, locale and other legal things (IANAL)
> the party that allowed you to download it committed copyright infringement by distributing it. the same restrictions apply to them as to you*. a copyright holder may well want that party to stop distributing it, which means no-one else can download it.
That's why every social media site in existence puts terms in its EULA demanding that users grant the site a blanket license to redistribute their content, over and above any separate licenses they may put on it. After it's been redistributed to third parties, the copyright holder has no more control (at least, not via copyright law) over how those copies are privately used.
E.g., on HN: "With respect to the content or other materials you upload through the Site or share with other users or recipients (collectively, 'User Content'), you represent and warrant that you own all right, title and interest in and to such User Content, including, without limitation, all copyrights and rights of publicity contained therein. By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose in any form, medium or technology now known or later developed."
If you post something on a physical bulletin board, you expect people will come by and read it
If a bunch of "scrapers" come by and form a massive mob so that nobody else can read it, and then cover the board with slightly different copies attributed to themselves, that isn't exactly the "public square" you imagined
> If you post something on a physical bulletin board, you expect people will come by and read it
Or birds, or public cameras, or people who take photos next to it, or people archiving bulletin boards, or...
If I put stuff in public spaces, I expect anyone and anything to be able to read, access and store it. Basically how I treat this very comment, and everything else I put on the public internet.
Eh, the internet is a weirdly semi-public space. Think of it more like a mall parking lot than a public common. If you put something up for notice there it will probably be fine. For example a number of grocery stores around me have boards for things just like that. But the moment it becomes a public nuisance for them they will trespass your ass outta there faster than a starved dog would eat a dropped hotdog.
As you say, it's not public as in public road (and even that has police to enforce proper behavior), but more like publicly accessible but on private properties.
When it's my server and I'm paying for it, then banning resources wasters is the right move.
You can’t have your cake and eat it, too. The power of the internet and mass media is that you can publish to the whole world, and make something public to billions of people. With that power comes side effects, such as, obviously, billions of people being able to privately do whatever they want with the information you, you know, published.
“public, but, no, not like that” isn’t a thing and no technological measure can make it a thing.
I don't know, this is like saying that because those bowls of mints at a restaurant are "free", I can back a trailer up to the door and start loading it up. Even if you know they'll never run out of mints.
I feel like you need to present a very strong case where LLMs are "the public" before you take such a weak position when interpreting the entirety of the article.
Drew makes it perfectly clear in TFA that "the public", as he sees it, is fully entitled and should make use of the data SourceHut provides.
If the AI scrapers respected the robots.txt file then this wouldn't be an issue. A company is allowed to set the terms of service for their service and take action if other companies are abusing that.
On the topic of licenses and LLM’s—of course, we have to applaud sourcehut at least trying to not allow all their code to be ingested by some mechanical license violation service. But, it seems like a hard game. Ultimately the job of their site is to serve code, so they can only be so restrictive.
I wonder if anyone has tried going in the opposite direction? Someone like adding to their license: “by training a machine learning algorithm trained on this source code or including data crawled from this site, you agree that your model is free to use by all, will be openly distributed, and any output generated by the model is licensed under open source terms.” (But, ya know, in bulletproof legalese). I guess most of these thieves won’t respect the bit about distributing. But at least if the model leaks or whatever, the open source community can feel free to use it without any moral conflict or legal stress.
If you squint at the GPL then you could argue that every LLM is already under it, because it's a viral license and there's almost certainly some GPL code in there somewhere. I'm sure the AI companies would beg to differ though, they want a one-way street where there's zero restrictions on IP going into models, but they can dictate whatever restrictions they like on the resulting model, derived models, and model output.
I hope one of the big proprietary models leaks one day so we get to see OpenAI or Google tie themselves in knots to argue that training on libgen is fine, but distilling a leaked copy of GPT or Gemini warrants death by firing squad.
I think the courts were pretty clear, prove damages. I'm not saying I agree in any capacity, but the AI companies went to court, and it appears they've already won.
The hope is that we can out the onus on them to start suing people or whatever, at least. The US legal system is biased toward whoever has the biggest budget of course, but the defense still gets a little bit of advantage as well.
> Someone like adding to their license
I would assume that clause would be unenforceable. They may be able to try to sue for violating the terms of the license, but I'm fairly confident they're not going to get a judge to order them to give their model away for free even if they won. And they would likely still need to show damages in order to win a contract case.
There have been copyright office rulings saying that ML model output is not copyrightable, so that last part of the suggested license seems a bit strange, since the rulings could preclude it for code at some point.
Also, it remains to be seen whether copyright law will outright allow model training without a license, or if there will be case law to make it fair use in the USA, or if models will be considered derivative works that require a license to prepare, or what other outcome will happen.
How can AI code be added to any kind of open source license, or would it just be that code that isn't covered under the license (since it's effectively public domain?)?
In most cases it would just be added like all other code as if it was proper/allowed until someone brought a civil suit with enough evidence to claim they had no permission to use the code AND it somehow damaged them, which could be quite difficult and prohibitively expensive.
So I would argue in most cases people will get away with it. We must remember that the only person's opinion that matters on what's actually illegal or not is a judge's.
It seems like AI companies are making a major bet on the idea that the output of their models can be licensed and used in products.
> There have been copyright office rulings saying that ML model output is not copyrightable
Source? If you're referring to Thaler v. Perlmutter, I would argue that is an incorrect assessment.
> The court held that the Copyright Act requires all eligible works to be authored by a human being. Since Dr. Thaler listed the Creativity Machine, a non-human entity, as the sole author, the application was correctly denied. The court did not address the argument that the Constitution requires human authorship, nor did it consider Dr. Thaler's claim that he is the author by virtue of creating and using the Creativity Machine, as this argument was waived before the agency.
If someone had instead argued that they themselves were the author by means of instructing an LLM to generate the output based on a prompt they made up, plus the influence of prior art (the pretrained data), which you could also argue is not only what he probably did anyway (but argued it poorly), but in a way is also how humans make art themselves... that would be a much more interesting decision IMO.
What does copyright have to do with it? It's about distribution.
Blocking aggressive crawlers - whether or not they have anything to do with AI - makes complete sense to me. There are growing numbers of badly implemented crawlers out there that can rack up thousands of dollars in bandwidth expenses for sites like SourceHut.
Framing that as "you cannot have our user's data" feels misleading to me, especially when they presumably still support anonymous "git clone" operations.
I agree; blocking aggressive crawlers that are badly behaved, etc, is what is sense. The files that are public are public and I should expect anyone who wants a copy can get them and do what they want with them.
I still maintain that since we already have this system (it's called "looking up your ISP and emailing them") where if you send spam emails, we contact your ISP, and you get kicked off the internet...
And the same system will also you get banned from your ISP if you port scan the Department of Defense...
why are we not doing the same thing against DoS attackers? Why are ISPs not hesitant to cut people off based on spam mail, but they won't do it based on DoS?
[dead]
From the Anubis docs
> Anubis uses a multi-threaded proof of work check to ensure that users browsers are up to date and support modern standards.
This is so not cool. Further gatekeeping websites from older browsers. That is absolutely not their call to make. My choice of browser version is entirely my decision. Web standards are already a change treadmill, this type of artificial "You must be at least Internet Explorer 11" or "this website works best in Chrome" nonsense makes it much worse.
My browser is supported by your website if it implements all the things your website needs. That is the correct test. Not: "Your User-Agent is looking at me funny!" or "The bot prevention system we chose has an arbitrary preference for particular browser versions".
Just run the thing single threaded if you have to.
>My browser is supported by your website if it implements all the things your website needs.
Well I guess your browser does not support everything needed?Being able to run a multi threaded proof of work is not the same as checking arbitrary user agents, any browser can implement it.
By default, Anubis will not block Lynx and some other browsers that do not implement JavaScripts, but will block scrapers that claim to be Mozilla-based browsers, and many of the badly behaving ones do claim to be Mozilla based browsers, so this helps. (I do not have a browser compatible with the Anubis software, and Anubis does not bother me.)
If necessary, it would also be possible to do what powxy does which is displaying an explanation of how the proof of work is working in case you want to implement your own.
Having alternative access of some files using other protocols, might also help, too.
So any bot author that reads your comment and switches their user agents to curl can now bypass anubis? That doesn't seem very well thought-out to me.
Patches are welcome!
Yeah, I don't like this.
We have an Apache licensed project. You absolutely can use it for anything you want, including AI analysis. I don't appreciate third parties deciding on their own what can be done with code that isn't theirs.
In fact I'd say AI is overall a benefit to our project, because we have a large, quite complex platform, and the fact that ChatGPT actually manages to sometimes correctly write scripts for it is quite wonderful. I think it helps new people get started.
In fact in light of the recent Github discussion I'd say I personally see this as a reason to avoid sourcehut. Sorry, but I want all the visibility I can get.
I was surprised to not see the "/s" at the end.
Big-Tech deciding that all our work belongs to them: Good
Small Code hosting platform does not want to be farmed like a Field of Corn: Bad
Why would you expect a /s?
I understand their standpoint: it's their infrastructure, and their bills.
However my concerns are with my project, not with their infrastructure bills. I seek to maximize the prominence, usability and general success of my project, and as such I want it to have a presence everywhere it can have one. I want ChatGPT/copilot/etc to know it exists and to write code for it, just in case that brings in more users.
Blocking abusive behavior? Sure. But I very specifically disagree with the blanket prohibition of "Anything used to feed a machine learning model". I do not see it being in my interest.
> I seek to maximize the prominence, usability and general success of my project, and as such I want it to have a presence everywhere it can have one
SourceHut is doing exactly how I expect them to act, and that's a good thing.
What did you expect from SourceHut, and why didn't you take this mindset off to GitHub in the first place?
> What did you expect from SourceHut, and why didn't you take this mindset off to GitHub in the first place?
I expect them (especially if they charge for it) to work in my interests as much as possible. Sure, defending themselves against abuse is fine. They have to survive to keep providing service.
However I don't appreciate the imposition of their own philosophy to something that isn't theirs. This here:
Is not okay with me.What do you suggest they do? Or is it just the political position that's the problem? The result is the same, pretty much every single AI company is abusing sourcehut.
They have to do something, because I pay for a service, and if I can't use it, I'm not paying in the future. If that means blocking the AI companies that's fine, they can contact me if they want to use my code, we'll figure something out.
I expect hosts to be neutral to the maximum possible extent.
For example I expect a host not to have an arbitrary beef with Bing or Kagi, or to refuse to allow connections from France. Blocking can of course be rarely necessary, but what I want from a host is a blocking policy as minimal and selective as possible.
Yes, I understand it's a lot of work and is quite inconvenient, but especially if I'm paying for a service, I'm interested in my interests, not in what's convenient for the host.
> but especially if I'm paying for a service
I don't believe you are paying for Sourcehut hosting, so why do you care?
For that matter, "This has been part of our terms of service since they were originally written in 2018" so even if you are paying for hosting, why did you start using their services in the first place?
I don't expect hosts to be neutral to the maximum possible extent. I exercise my freedom of association to select hosts which are more aligned to my beliefs.
My beliefs, for example, say that I shouldn't use services from companies which build tools to support an apartheid state (eg, https://www.theverge.com/news/643670/microsoft-employee-prot... ), nor from companies which host those projects.
Even if being neutral were more profitable for them and cheaper for me.
> I don't believe you are paying for Sourcehut hosting, so why do you care?
I theoretically could, and it's posted here I imagine to discuss the linked post. So I am.
> For that matter, "This has been part of our terms of service since they were originally written in 2018"
The "No AI" bit seems to show up only in late 2024. Which I'd regard as an extremely unwelcome development had I been paying.
> I don't expect hosts to be neutral to the maximum possible extent. I exercise my freedom of association to select hosts which are more aligned to my beliefs.
Likewise. In my case it's my belief is that when you pay somebody, it's to get things done your way. So for instance I'd be a lot more pleased with a setting.
> I expect hosts to be neutral to the maximum possible extent.
Why? Who gets to say what's acceptable?
You don't get any attribution, so how can you tell if they're using it or not? This seems like a philosophical argument instead of a technical one.
What attribution are you talking about? Here's what I mean:
You can go to ChatGPT and ask it: "please write a Python script that prints "Hello world" in red". And that works.
And you can also go to ChatGPT and ask it: "Please write an Overte script that makes an object red when it's clicked".
And I really like that this works. I certainly don't want it to stop working because Sourcehut has something against LLMs.
I don't think the use cases you're describing are what any critics are talking about.
How do you feel about someone with more funding than you going to an LLM and saying, "Reimplement the entire Overte source for me, but change it superficially so that Overte has a hard time suing me for stealing their IP?"
We're Apache licensed. A LLM seems very overkill.
I see, I encountered something similar with a DSL. For my use case, I had better results by having a LLM scrape a well formed doc reference page than a source code repo, I'd assume that same behavior extends to training data.
Oh, I'm sure there's all sorts of practical considerations regarding optimal LLM training.
All the same though, I don't like my host being so opinionated. I don't want a host that has something against any of the common search engines, and I don't want a host that has something against LLMs. Hosts should be as neutral as possible.
The code might not be theirs but the service hosting the code is and nothing is stopping you from hosting your code elsewhere. For some people blocking LLMs might be a reason to use sourcehut over github.
This would be fine in an ideal world. However, the one we live in has crawlers that don't care how many resources they use. They're fine with taking the server down or bankrupting the owner as long as they get the data they want.
And I can understand the abuse argument, however they have a blanket exclusion for AI I do not agree with.
>We have an Apache licensed project. You absolutely can use it for anything you want, including AI analysis. I don't appreciate third parties deciding on their own what can be done with code that isn't theirs.
That's not what the Apache license says.
According to the Apache license, derivative work needs to provide attribution and a copyright notice, which no major LLMs do.
I pay for sourcehut hosting, and I have no problems at all with this decision.
Unlike you with your GitHub-based project, I avoid Microsoft like the plague. I do not want to be complicit in supporting their monopoly power, their human rights abuses, and their environmental destruction.
You won't get visibility from AI.
I'm curious what your project is. Blockchain?
VR platform. We're actually opposed to any blockchain tech as an organization. We're into it for the users.
But I see no reason to have any issues with LLMs. ChatGPT/copilot/etc helping new people getting started? That sounds absolutely great to me.
[flagged]
We're effectively open source VR chat, if you want to get an idea of what it's like.
Crypto-focused projects of this kind include Decentraland. Which tend to devolve into things like selling virtual land. That's not our jam and doesn't align with the way the project works anyway -- you can have all the land you want, and set things up for free on your own server.
I don't have any more issues with ChatGPT than I have with Google and Kagi. All of those are closed source projects. But by all means, I love open source, so if an open source LLM can do something like writing code for our platform, that'd be wonderful.
I see, now it makes sense for me with that context
> Sorry, but I want all the visibility I can get.
I can understand that, but the various AI companies pounding sourcehut into the ground also results in zero visibility.
Anubis has had great results blocking LLM agents https://anubis.techaro.lol/
That's what sourcehut is using.
As an aside, I saw this for the first time on a kernel.org website a few days ago and actually thought it might have been hacked since I briefly saw something about kH/s (which screams cryptominer to me).
A screenshot for anyone who hasnt seen it: https://i.imgur.com/dHOmHtn.png
(this screen appears only very briefly, so while it is clear what it is from a static screenshot, its very hard to tell in real time)
Yes, this is explained and linked in the first sentence of the linked article.
Looks like git diffs are the new gold for training LLMs : https://carper.ai/diff-models-a-new-way-to-edit-code/
The solution is to make Git fully self-contained and encrypted, just like Fossil[1] - store issues and PRs inside the repository itself, truly distributed system.
[1] https://fossil-scm.org/
> a racketeer like CloudFlare
Could anyone teach me what makes this a fair characterization of Cloudflare?
Not sure exactly what it is referring to, but I could make a guess that it's because Cloudflare sells LLM inference as a service, but also a service that blocks LLMs. A bit like a Anti-DDOS company also selling DDOS services.
For example, https://developers.cloudflare.com/workers-ai/guides/demos-ar... has examples visit websites, then for the people on the other side (who want to protect themselves against those visits) there is https://developers.cloudflare.com/waf/detections/firewall-fo...
Just a guess though, I don't know for sure the authors intentions/meanings.
> A bit like a Anti-DDOS company also selling DDOS services.
That's not far off from what Cloudflare does either, the majority of DDoS-for-hire outfits are hidden behind and protected by Cloudflare.
Whenever this comes up I do a quick survey of the top DDG results for "stresser" to see if anything's changed, and sure enough:
"Just a guess though, I don't know for sure the authors intentions/meanings."
I am reminded of this posting from years past:
https://news.ycombinator.com/item?id=38496499
"A lot of hacking groups, terror organizations and other malicious actors have been using cloud flare for a while without them doing shit about it. ... It's their business model. More DDoS means more cloudflare customers, yaaay."
I've not spent much time on this topic but I am very interested in the notion that a well-meaning third party established some kind of looking glass that surveyed cloudflare behavior and that third party was sued ?
I'd like to learn more about that situation ...
I remember that when the first influxes of LLM crawlers have hit Sourcehut, they had some talks with Cloudflare which ended when CF demanded an outrageous amount of money from a company the size of Sourcehut. If I find the source for this, I'll update.
[edit] Here's the source: https://sourcehut.org/blog/2024-01-19-outage-post-mortem/#:~...
Cloudflare has been accused of playing both sides--they host services for known/associated DDoS providers while conveniently offering services to protect DDoS.
How about use this to contribute an absolutely tiny amount of hashes to a mining pool on behalf of the website owner, instead of just burning the energy
> All features work without JavaScript
Maybe they should update their bullet points...
The footnote saying "fuck you now, maybe come back later" is really encouraging.
What I don't understand is why these scrapers so aggressively scrape websites which barely change. How much value is OpenAI etc getting from hammering the ever-living shit out of my website thousands of times a day when the content changes weekly at most? I truly don't understand the tactic. Surely their resources are better spent elsewhere?
How sure are we that they're actually LLM scrapers and not just someone trying to DDoS source hut with plausible deniability?
The LLM scrapers could publish the ip ranges they use for scraping like google does, but that would make it easier to block them so they probably wouldn't do that.
https://developers.google.com/search/docs/crawling-indexing/...
[flagged]
Use of published information is still always constrained by copyright law. If I had a copyrighted movie playing on my television visible through the window, and you recorded that, redistributing that recording would unambiguously be a violation of copyright law and piracy.
I’m also a little confused by what you’re saying here; are you asking whether scraper bots are illegal, or whether they’re immoral/unethical?
Looking through your window is already covered by a lot of laws (it’s legal sometimes in some places if you didn’t take reasonable effort to prevent it [like closing the blinds], and as long as there was no trespass). Of If I captured a small enough section of a video - say one frame in a photo- that likely is fair use. It is not crystal clear.
I’m getting to the ethical aspect but also trying to be pragmatic. “Publishing a bunch of information on the internet accessible without authentication” is an action that is fundamentally incompatible with controlling the use of that information.
The law cannot substitute for common sense; criminals are gonna crime.
Your drone makes a loud buzzing sound, and blocks the street for anyone else trying to get in, and does not move on its own. This is where it escalates from "Taking advantage of public information" to "Harassment".
Crawling source hut once is public information. Crawling it once a day using deltas might still be that. What these AI companies are doing is not that.
Pretending that published data isn’t public is a fool’s errand.
The point of a web host is to serve the users’ data to the public.
Anything else means the web host is broken.
I feel like we're part of a dying generation or something. I keep seeing people who want to post content to the public internet, but they still want to own the data somehow, and control who see it, but still on the public internet.
I'm not sure how it's supposed to work, as I see the public internet as just that, a public square. What goes there can be picked up by anyone, for any purpose. If I want something to be secret, I don't put it on the public internet.
Gonna be interesting to see how that "public but in my control" movement continues to evolve, because it feels like they're climbing an impossible wall.
> I'm not sure how it's supposed to work
Laws.
This one is called copyright.
> Gonna be interesting to see how that "public but in my control" movement continues to evolve
Berne convention is almost 150 years old now...
Yes, laws work just great over the global public internet.
You have to actually utilize the legal system. Other suggest attaching licenses to they published content/data preventing AI training, but attaching the license alone does nothing. You have to actually drag someone to court once in a while for it to work.
It's the same complaint about the GDPR, if it works why are site still doing X/Y/Z... Well because all people do is complain online, you need to report violations and be prepared to take legal action.
As someone that ran SMTP systems for years legal complaints only work against people in your country and those it works with.
"Dear Russia, pwetty pweeese don't hammer my server to death stealing everything in sight"
[Crickets]
Russian IP ban
I get what your saying, and I do question the value of suing someone in Russia, or China, but did you actually get a lawyer file an actual real lawsuit in Russia? Again, Russia, probably no really going to work.
There's absolutely no reason why you couldn't drag OpenAI to court... you'd need a ton of money, but you could and if you win, then the rest of the AI companies are going to get very busy adjusting their behaviour.
In the music biz, copyright is sometimes used to prevent very specific uses of material. Recent examples include a number of musicians (or perhaps their labels or publishing houses) denying the use of certain pieces of music at rallies for particular politician.
IANAL, but my read of this is that if the content has the appropriate licence, the licence holder can withhold certain rights & access from certain groups of potential licensees. I'm loosely aware that the common open source licences are highly permissive, so probably they can't be used in this way... but presumably not all licences are like that. So, even though the work is "public", it should still be possible to enforce subsets of rights.
And to take your "public square" analogy... before we had cameras everywhere, there was some expectation of "casual" privacy even in public spaces. Not everyone in the square hears everything said by everyone else in the square. The fact that digital tools make privacy breaches much easier doesn't mean it should be tolerated.
(that said, I'm fairly careful what I publish online)
https://en.wikipedia.org/wiki/Tragedy_of_the_commons ; people could be nicer about their use of public spaces / resources.
Though that also goes towards those posting content. Systems that generate an infinite number of permutations for viewing the same information are a poor design. It can easily lead to even conscientious people discovering that a simple attempt to slowly mirror a website overnight with wget has resulted in some rabbit hole of forum view parameter explosion.
The public internet is a public square, but does it have to be some caricature 1970's Times Square dominated by dealers, pimps, and thugs all looking to extract whatever they can from me?
I'm one of those people. I think it comes down to the intent. There is an implicit good will of those that do this that the data isn't abused or the infrastructure behind it overwhelmed (self-hosting). "Big Tech" just make this worse, because their motivations aren't the same as ours (small web).
Once all devices are locked down and all control has been taken away from the users, we can finally have functional DRM on every device and make this dream come true.
> I feel like we're part of a dying generation or something
Well... yes. Just like every living thing that's ever existed ;)
> public but in my control
I think we have a word for that - copyright
But even copyright doesn't give you "control" over something. I can still download it and use it however I want, privately, just that I'm limited legally to further distribute it. Unless I change it sufficiently, then it's again OK to distribute my changed one. Of course, depending on country and whatnot.
Problem remains the same, as soon as the content is visible on the public internet, you lose 100% control of it's propagation, illegal or not.
> I can still download it and use it however I want, privately, just that I'm limited legally to further distribute it
the party that allowed you to download it committed copyright infringement by distributing it. the same restrictions apply to them as to you*. a copyright holder may well want that party to stop distributing it, which means no-one else can download it.
it’s just moving the ‘control’ up a level to a different party. it’s still there.
> as soon as the content is visible on the public internet, you lose 100% control of it's propagation, illegal or not.
GPL license: thou shall always* provide access to this software’s source code —> form of control in the opposite direction to music copyright, but still a form of control
* subject to terms, conditions, locale and other legal things (IANAL)
> the party that allowed you to download it committed copyright infringement by distributing it. the same restrictions apply to them as to you*. a copyright holder may well want that party to stop distributing it, which means no-one else can download it.
That's why every social media site in existence puts terms in its EULA demanding that users grant the site a blanket license to redistribute their content, over and above any separate licenses they may put on it. After it's been redistributed to third parties, the copyright holder has no more control (at least, not via copyright law) over how those copies are privately used.
E.g., on HN: "With respect to the content or other materials you upload through the Site or share with other users or recipients (collectively, 'User Content'), you represent and warrant that you own all right, title and interest in and to such User Content, including, without limitation, all copyrights and rights of publicity contained therein. By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose in any form, medium or technology now known or later developed."
Have to agree. If you want to limit who sees your content, put it behind a paywall or some sort of subscription.
For Anubis it makes sense.
If you post something on a physical bulletin board, you expect people will come by and read it
If a bunch of "scrapers" come by and form a massive mob so that nobody else can read it, and then cover the board with slightly different copies attributed to themselves, that isn't exactly the "public square" you imagined
> If you post something on a physical bulletin board, you expect people will come by and read it
Or birds, or public cameras, or people who take photos next to it, or people archiving bulletin boards, or...
If I put stuff in public spaces, I expect anyone and anything to be able to read, access and store it. Basically how I treat this very comment, and everything else I put on the public internet.
Eh, the internet is a weirdly semi-public space. Think of it more like a mall parking lot than a public common. If you put something up for notice there it will probably be fine. For example a number of grocery stores around me have boards for things just like that. But the moment it becomes a public nuisance for them they will trespass your ass outta there faster than a starved dog would eat a dropped hotdog.
As you say, it's not public as in public road (and even that has police to enforce proper behavior), but more like publicly accessible but on private properties.
When it's my server and I'm paying for it, then banning resources wasters is the right move.
You can’t have your cake and eat it, too. The power of the internet and mass media is that you can publish to the whole world, and make something public to billions of people. With that power comes side effects, such as, obviously, billions of people being able to privately do whatever they want with the information you, you know, published.
“public, but, no, not like that” isn’t a thing and no technological measure can make it a thing.
I don't know, this is like saying that because those bowls of mints at a restaurant are "free", I can back a trailer up to the door and start loading it up. Even if you know they'll never run out of mints.
I feel like you need to present a very strong case where LLMs are "the public" before you take such a weak position when interpreting the entirety of the article.
Drew makes it perfectly clear in TFA that "the public", as he sees it, is fully entitled and should make use of the data SourceHut provides.
LLMs are just tools, run by human beings who are naturally members of the public. There is no confusion or ambiguity here.
So are DDoS scripts
But one can argue that the LLM crawlers deny the rest of the public access to your data by consuming all available bandwidth.
If the AI scrapers respected the robots.txt file then this wouldn't be an issue. A company is allowed to set the terms of service for their service and take action if other companies are abusing that.
Opening a bakery and feeding the entire world aren't the same.
Then put your content behind a paywall. Bakeries typically aren't free.