robots.txt main purpose back in the day was curtailing penalties in the search engines when you got stuck maintaining a badly-built dynamic site that had tons of dynamic links and effectively got penalized for duplicate content. It was basically a way of saying "Hey search engines, these are the canonical URLs, ignore all the other ones with query parameters or whatever that give almost the same result."
It could also help keep 'nice' crawlers from getting stuck crawling an infinite number of pages on those sites.
Of course it never did anything for the 'bad' crawlers that would hammer your site! (And there were a lot of them, even back then.) That's what IP bans and such were for. You certainly wouldn't base it on something like User-Agent, which the user agent itself controlled! And you wouldn't expect the bad bots to play nicely just because you asked them.
That's about as naive as the Do-Not-Track header, which was basically kindly asking companies whose entire business is tracking people to just not do that thing that they got paid for.
Or the Evil Bit proposal, to suggest that malware should identify itself in the headers. "The Request for Comments recommended that the last remaining unused bit, the "Reserved Bit" in the IPv4 packet header, be used to indicate whether a packet had been sent with malicious intent, thus making computer security engineering an easy problem – simply ignore any messages with the evil bit set and trust the rest."
While we're at it, it should be noted that Do Not Track was not, apparently, a joke.
It's the same as a noreply email, if you can get away with sticking your fingers in your ears and humming when someone is telling you something you don't want to hear, and you have a computer to hide behind, then it's all good.
It is ridiculous, but it is what you get when you have conflicting interests and broken legislation. The rule is that tracking has to be opt-in, so websites do it the way they are more likely to get people to opt in, and it is a cookie banner before you access the content.
Do-not-track is opt-out, not opt-in, and in fact, it is not opt-anything since browsers started to set it to "1" by default without asking. There is no law forcing advertisers to honor that.
I guess it could work the other way: if you set do-not-track to 0 (meaning "do-track"), which no browser does by default, make cookies auto-accept and do not show the banner. But then the law says that it should require no more actions to refuse consent than to consent (to counter those ridiculous "accept or uncheck 100 boxes" popups), so it would mean they would also have to honor do-not-track=1, which they don't want to.
I don't know how legislation could be unbroken. Users don't want ads, don't want tracking, they just want the service they ask for and don't want to pay for it. Service providers want exactly the opposite. Also people need services and services need users. There is no solution that will satisfy everyone.
> since browsers started to set it to "1" by default without asking
IIRC IE10 did that, to much outcry because it upended the whole DNT being an explicit choice; no other browser (including Edge) set it as a default.
There have been thoughts about using DNT (the technical communication mechanism about consent/objection) in correlation with GDPR (the legal framework to enforce consent/objection compliance)
The GDPR explicitly mentions objection via technical means:
> In the context of the use of information society services, and notwithstanding Directive 2002/58/EC, the data subject may exercise his or her right to object by automated means using technical specifications.
I myself consider DNT as what it means at face value: I do not want to be tracked, by anyone, ever. I don't know what's "confusing" about that.
The only ones that are "confused" are the ones it would be detrimental to i.e the ones that perform and extract value from the tracking, and make people run in circles with contrived explanations.
It would be perfectly trivial for a browser to pop up a permission request per website like there is for webcams or microphone or notifications, and show no popup should I elect to blanket deny through global setting.
Labor laws are not set to satisfy everyone, they are set such that a company cannot use it’s outsized power to exploit their workers, and that workers have fair chance at negotiating a fair deal, despite holding less power.
Similarly consumer protection laws—which the cookie banners are—are not set to satisfy everyone, they are set such that companies cannot use their outsized power to exploit their customers. A good consumer protection law will simply ban harmful behavior regardless of whether companies which engage in said harmful behavior want are satisfied with that ban or not. A good consumer protection law, will satisfy the user (or rather the general public) but it may satisfy the companies.
Good consumer protection laws are things like disclosure requirements or anti-tying rules that address information asymmetries or enable rather than restrict customer choice.
Bad consumer protection laws try to pretend that trade offs don't exist. You don't want to see ads, that's fine, but now you either need to self-host that thing or pay someone else money to do it because they're no longer getting money from ads.
There is no point in having an opt in for tracking. If the user can be deprived of something for not opting in (i.e. you can't use the service) then it's useless, and if they can't then the number of people who would purposely opt in is entirely negligible and you ought to stop beating around the bush and do a tracking ban. But don't pretend that's not going to mean less "free stuff".
The problem is legislators are self-serving. They want to be seen doing something without actually forcing the trade off that would annihilate all of these companies, so instead they implement something compromised to claim they've done something even though they haven't actually done any good. Hence obnoxious cookie banners.
That whole argument assumes that you as a consumer can always find a product with exactly the features you want. Because that's a laughable fiction, there need to be laws with teeth to punish bad behaviors that nearly every product would indulge in otherwise. That means things like requiring sites to get permission to track, and punishing those that track users without permission. It's a good policy in theory, but it needs to be paired with good enforcement, and that's where things are currently lacking.
> That's whole argument assumes that you as a consumer can always find a product with exactly the features you want. Because that's a laughable fiction
There are very many industries where this is exactly what happens. If you want a stack of lumber or a bag of oranges, it's a fungible commodity and there is no seller who can prevent you from buying the same thing from someone else if you don't like their terms.
If this is ever not the case, the thing you should be addressing is that, instead of trying to coerce an oligopoly that shouldn't exist into behaving under the threat of government penalties rather than competitive pressure. Because an uncompetitive market can screw you in ten thousand different ways regardless of whether you've made a dozen of them illegal.
> That means things like requiring sites to get permission to track, and punishing those that track users without permission. It's a good policy in theory, but it needs to be paired with good enforcement, and that's where things are currently lacking.
It's not a good policy in theory because the theory is ridiculous. If you have to consent to being tracked in exchange for nothing, nobody is going to do that. If you want a ban on tracking then call it what it is instead of trying to pretend that it isn't a ban on the "free services in exchange for tracking data" business model.
I think you might be misunderstanding the purpose of consumer protection. It is not about consumer choice, but rather it is about protecting consumer from the inherent power imbalance that exists between the company and their customers. If there is no way to doing a service for free without harming the customers, this service should be regulated such that no vendor is able to provide this service for free. It may seem punishing for the customers, but it is not. It protects the general public from this harmful behavior.
I actually agree with you that cookie banners are a bad policy, but for a different reason. As I understand it there are already requirements that the same service should also be available to opt-out users, however as your parent noted, enforcement is an issue. I, however, think that tracking users is extremely consumer hostile, and I think a much better policy would be a simple ban on targeted advertising.
> I think you might be misunderstanding the purpose of consumer protection. It is not about consumer choice, but rather it is about protecting consumer from the inherent power imbalance that exists between the company and their customers.
There isn't an inherent power imbalance that exists between the company and their customers, when there is consumer choice. Which is why regulations that restrict rather than expand consumer choice are ill-conceived.
> If there is no way to doing a service for free without harming the customers, this service should be regulated such that no vendor is able to provide this service for free.
But that isn't what those regulations do, because legislators want to pretend to do something while not actually forcing the trade off inherent in really doing the thing they're only pretending to do.
> I, however, think that tracking users is extremely consumer hostile, and I think a much better policy would be a simple ban on targeted advertising.
Which is a misunderstanding of the problem.
What's actually happening in these markets is that we a) have laws that create a strong network effect (e.g. adversarial interoperability is constrained rather than required) which means that b) the largest networks win, and the networks available for free then becomes the largest.
Which in turn means you don't have a choice, because Facebook is tracking everyone but everybody else is using Facebook, which means you're stuck using Facebook.
If you ban the tracking while leaving Facebook as the incumbent, two things happen. First, those laws are extremely difficult to enforce because neither you nor the government can easily tell what they do with the information they inherently get from the use of a centralized service, so they aren't effective. And second, they come up with some other business model -- which will still be abusive because they still have market power from the network effect -- and then get to blame the new cash extraction scheme on the law.
Whereas if you do what you ought to do and facilitate adversarial interoperability, that still sinks their business model, because then people are accessing everything via user agents that block tracking and ads, but it does it while also breaking their network effect by opening up the networks so they can't use their market power to swap in some new abusive business model.
For one, Do Not Track is on the client side and you just hope and pray that the server honors it, whereas cookie consent modals are something built by and placed in the website.
I think you can reasonably assume that if a website went through the trouble of making such a modal (for legal compliance reasons), the functionality works (also for legal compliance reasons). And, you as the client can verify whether it works, and can choose not to store them regardless.
I would assume most websites would still set cookies even if you reject the consent, because the consent is only about not technically necessary cookies. Just because the website sets cookies doesn't tell you whether it respects you selection. Only if it doesn't set any cookies can you be sure, and I would assume that's a small minority of websites.
The goal with Do Not Track was legal (get governments to recognize it as the user declining consent for tracking and forbidding additional pop-ups) and not technological.
Unfortunately, the legal part of it failed, even in the EU.
So it did the same work that a sitemap does? Interesting.
Or maybe more like the opposite: robots.txt told bots what not to touch, while sitemaps point them to what should be indexed. I didn’t realize its original purpose was to manage duplicate content penalties though. That adds a lot of historical context to how we think about SEO controls today.
> I didn’t realize its original purpose was to manage duplicate content penalties though.
That wasn’t its original purpose. It’s true that you didn’t want crawlers to read duplicate content, but it wasn’t because search engines penalised you for it – WWW search engines had only just been invented and they didn’t penalise duplicate content. It was mostly about stopping crawlers from unnecessarily consuming server resources. This is what the RFC from 1994 says:
> In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).
> It was mostly about stopping crawlers from unnecessarily consuming server resources.
Very much so.
Computation was still expensive, and http servers were bad at running cgi scripts (particularly compared to the streamlined amazing things they can be today).
SEO considerations came way way later.
They were also used, and still are, by sites that have good reasons to not want results in search engines. Lots of court files and transcripts, for instance, are hidden behind robots.txt.
I think this is still relevant today in cases where there are not many resources available: think free tiers, smallest fixed cost/fixed allocation scenarios, etc.
The scenario I remember was that the underfunded math department had an underpowered server connected via a wide and short pipe to the overfunded CS department and webcrawler experiments would crash the math department's web site repeatedly.
What everybody is missing is that AI inference (not training) is a route out of the enshittification economy. One reason why Cloudflare is harassing you all the time to click on traffic lights and motorcycles is to slam the door from some of the exit routes.
It is so interesting to track this technology's origin back to the source. It makes sense that it would come from a background of limited resources where things would break if you overwhelm it. It didn't take much to do so.
I always consider "good" a bot that doesn't disguise itself and follows the robots.txt rules. I may not consider good the final intent of the bot or the company behind it, but the crawler behaviour is fundamentally good.
Especially considering the fact that it is super easy to disguise a crawler and not follow the robots conventions
Well you as the person running a website can define unilaterally what you consider good and bad. You may want bots to crawl everything, nothing or (most likely) something inbetween. Then you judge bots based on those guidelines. You know like a solicitor that rings your bell that has a text above it saying "No solicitors", certain assumptions can be made about those who ignore it.
I admit I'm one of those people. After decades where I should perhaps be a bit more cynical, from time to time I am still shocked or saddened when I see people do things that benefit themselves over others.
But I kinda like having this attitude and expectation. Makes me feel healthier.
> Trust by default, also by default, never ignoring suspicious signals.
While I absolutely love the intent of this idea, it quickly falls apart when you're dealing with systems where you only get the signals after you've already lost everything of value.
i like that Veritasium vid a lot, i've watched it a couple times. The thing is, there's no way to retaliate against a crawler ignoring robots.txt. IP bans don't work, user agent bans don't work, there's no human to shame on social media ether. If there's no way to retaliate or provide some kind of meaningful negative feedback then the whole thing breaks down. Back to the Veritasium video, if a crawler defects they reap the reward but there's no way for the content provider to defect so the crawler defects 100% of the time and gets 100% of the defection points. I can't remember when i first read the rfp for robots.txt but I do remember finding it strange that it was a "pretty please" request against a crawler that has a financial incentive to crawl as much as it can. Why even go through the effort to type it out?
EDIT: i thought about it for a min, i think in the olden days a crawler crawling every path through a website could yield an inferior search index. So robots.txt gave search engines a hint on what content was valuable to index. The content provider gained because their SEO was better (and cpu util. lower) and the search engine gained because their index was better. So there was an advantage to cooperation then but with crawlers feeding LLMs that isn't the case.
This is a really cool tool. I haven't seen it before. Thank you for sharing it!
On their README.md they state:
> This program is designed to help protect the small internet from the endless storm of requests that flood in from AI companies. Anubis is as lightweight as possible to ensure that everyone can afford to protect the communities closest to them.
It's easy to believe, though, and most of us do it every day. For example, our commute to work is marked by the trust that other drivers will cooperate, following the rules, so that we all get to where we are going.
There are varying degrees of this through our lives, where the trust lies not in the fact that people will just follow the rules because they are rules, but because the rules set expectations, allowing everyone to (more or less) know what's going on and decide accordingly. This also makes it easier to single out the people who do not think the rules apply to them so we can avoid trusting them (and, probably, avoid them in general).
In Southern Europe, and countries with similar cultures, we don't obey rules because someone says so, we obey them when we see that is actually reasonable to do so, hence my remark regarding culture as I also experienced living in countries where everyone mostly blindly follow the rules, even if they happen to be nonsense.
Naturally I am talking about cultures where that decision has not been taken away from their citizens.
> I also experienced living in countries where everyone mostly blindly follow the rules, even if they happen to be nonsense.
The problem with that is that most people are not educated enough to judge what makes sense and what doesn’t, and the less educated you are, the more likely you are to believe you know what makes sense when you’re actually wrong. These are exactly the people that should be following the rules blindly, until they actually put in the effort to learn why those rules exist.
I believe there is a difference between education and critical thinking. One may not have a certain level of education, but could exercise a great degree of critical thinking. I think that education can help you understand the context of the problem better. But there are also plenty of people who are not asking the right questions or not asking questions - period - who have lots of education behind them. Ironically, sometimes education is the path that leads to blind trust and lack of challenging the status quo.
> the less educated you are, the more likely you are to believe you know what makes sense
It actually frightens me how true this statement is.
To reinforce my initial position about how important the rules are for setting expectations, I usually use cyclists as an example. Many follow the proposed rules, understanding they are traffic, and right of way is not automagically granted based on the choice of vehicle, having more to do with direction and the flow of said traffic.
But there's always a bad apple, a cyclist who assumes themselves to be exempt from the rules and rides against the flow of traffic, then wonders why they got clipped because a right-turning driver wasn't expecting a vehicle to be coming from the direction traffic is not supposed to come from.
In the end, it's not really about what we drive or how we get around, but whether we are self-aware enough to understand that the rules apply to us, and collectively so. Setting the expectation of what each of our behaviors will be is precisely what creates the safety that comes with following them, and only the dummies seem to be the ones who think they are exempt.
As a French, being passed by the right by Italian drivers on the highway really makes me feel the superiority of Southern Europeans judgment over my puny habit of blindly following rules. Or does it?
But yes, I do the same. I just do not come here to pretend this is virtue.
The rules in France are probably different but passing on the right is legal on Italian highways, in one circumstance: if one keeps driving on the lane on the right and somebody slower happens to be driving on the lane on the left. The rationale is that it normally happens when traffic is packed, so it's ok even if there is little traffic. Everybody keep driving straight and there is no danger.
It's not legal if somebody is following the slower car on the left and steers to the right to pass. However some drivers stick to the left at a speed slower than the limit and if they don't yield what happens is that eventually they get passed on the right.
The two cases have different names. The normal pass is "sorpasso", the other one (passing by not steering) is "superamento", which is odd but they had to find a word for it.
Not sure if it is a virtue, but standing as a pedestrians in an empty street at 3 AM waiting for a traffic light to turn green doesn't make much sense either, it isn't as if a ghost car is coming out of nowhere.
It should be a matter of judgement and not following rules just because.
I kind of agree. The rules for safety should be simple, straightforward, and protect you in the "edge cases", i.e. following while not paying 100% of attention, protect you with a malicious actor in mind aka reckless driver, etc. Ideally, in a system like that it should be a difficult and intentional behavior if one wanted to break the rules rather than to follow them.
I agree. I mostly mean that it is good to strive towards a system of rules that will be easy to follow and difficult to break by default. That is an ideal case. In reality, it is never that simple.
> For example, our commute to work is marked by the trust that other drivers will cooperate, following the rules, so that we all get to where we are going.
That trust comes from the knowledge that it's likely that those drivers also don't want to crash, and would rather prefer to get where they're going.
I apologize for that. I try to mitigate my US-centricness in my comments as much as possible, understanding completely that I am speaking with a global audience, but I am definitely not perfect at it :D
I suppose the same goes if you take the tube, ride a bike, walk, etc? There's still rules in terms of behavior, flow of traffic (even foot traffic), etc, that helps set a number of expectations so everyone can decide and behave accordingly. Happy to hear different thoughts on this!
I still see the value in robots.txt and DNT as a clear, standardised way of posting a "don't do this" sign that companies could be forced to respect through legal means.
The GDPR requires consent for tracking. DNT is a very clear "I do not consent" statement. It's a very widely known standard in the industry. It would therefore make sense that a court would eventually find companies not respecting it are in breach of the GDPR.
Would robot traffic be considered tracking in light of GDPR standards? As far as I know there are no regulatory rules in relation to enforcing robots behaviors outside of robots.txt, which is more of an honor system.
DNT and GDPR was just an example. In a court case about tracking, DNT could be found to be a clear and explicit opt out. Similarly, in a case about excessive scraping or the use of scraped information, robots txt could be used as a clear and explicit signal that the site operator does not want their pages harvested. It all but certainly gets rid of the "they put it on the public web so we assumed we can scrape it, we can'task everyone for permission" argument. They can't claim it was "in good faith" if there's a widely-accepted standard for opting out.
> That's about as naive as the Do-Not-Track header, which was basically kindly asking companies whose entire business is tracking people to just not do that thing that they got paid for.
It's usually a bad default to assume incompetence on the part of others, especially when many experienced and knowledgeable people have to be involved to make a thing happen.
The idea behind the DNT header was to back it up with legislation-- and sure you can't catch and prosecute all tracking, but there are limitations on the scale of criminal move fast and break things before someone rats you out. :P
I created a search engine that crawled the web way back in 2003. I used a proper user agent that included my email address. I got SO many angry emails about my crawler, which played as nice as I was able to make it play. Which was pretty nice I believe. If it’s not Google people didn’t want it. That’s a good way to prevent anyone from ever competing with Google. It isn’t just about that preview for LinkedIn, it’s about making sure the web is accessible by everyone and everything that is trying to make its way. Sure, block the malicious ones. But don’t just assume that every bot is malicious by default.
I definitely agree here. My initial response was to block everything, however you realize that web is complex and interdependent. I still believe that everyone should have autonomy over their online resources if they desire. But that comes with an intentionality behind it. If you want to allow or disallow certain traffic, you also should answer the question why or why not. That requires understanding what each bot does. That takes time and effort.
My foray into robots.txt started from the whole notion of AI companies training on everything they can put their hands on. I want to be able to have a say whether I allow it or not. While not all bots will honor the robots.txt file, there are plenty that do. One way that I found you can test that is by asking the model directly to scrape a particular link (assuming the model has browsing capabilities).
Bots are not malicious by default. It is what that company does with your data and how you feel about it that matters in the end.
The most annoying thing about being a good bot owner, in my experience, is when you get complaints about it misbehaving, only to find that it was actually somebody malicious who wrote their own abusive bot, but is using your bot's user agent.
Cloudflare have some new bot verification proposals designed to fix this, with cryptographic proofs that the user-agent is who they say they are: https://blog.cloudflare.com/web-bot-auth/.
That's easy to say when it's your bot, but I've been on the other side to know that the problem isn't your bot, it's the 9000 other ones just like it, none of which will deliver traffic anywhere close to the resources consumed by scraping.
True. Major search engines and bots from social networks have a clear value proposition: in exchange for consuming my resources, they help drive human traffic to my site. GPTBot et al. will probably do the same, as more people use AI to replace search.
A random scraper, on the other hand, just racks up my AWS bill and contributes nothing in return. You'd have to be very, very convincing in your bot description (yes, I do check out the link in the user-agent string to see what the bot claims to be for) in order to justify using other people's resources on a large scale and not giving anything back.
An open web that is accessible to all sounds great, but that ideal only holds between consenting adults. Not parasites.
> GPTBot et al. will probably do the same, as more people use AI to replace search.
It really won’t. It will steal your website’s content and regurgitate it back out in a mangled form to any lazy prompt that gets prodded into it. GPT bots are a perfect example of the parasites you speak of that have destroyed any possibility of an open web.
That was my hunch. My initial post on robots.txt: https://evgeniipendragon.com/posts/i-am-disallowing-all-craw... - revolved around blocking AI models from doing that because I do not believe that it will bring more traffic to my website - it will use the content to keep people using their service. I might be proven wrong in the future, but I do not see why they would want to let go of an extra opportunity to increase retention.
Which is all you need a lot of the time. If you're a hotel, or restaurant, or selling a product, or have a blog to share information important to you, then all you need is for the LLM to share it with the user.
"Yes, there are a lot of great restaurants in Chicago that cater to vegans and people who enjoy musical theater. Dancing Dandelions in River North is one." or "One way to handle dogs defecating in your lawn is with Poop-Be-Gone, a non-toxic product that dissolves the poop."
It's not great for people who sell text online (journalists, I guess, who else?). But that's probably not the majority of content.
You are bringing a great point. In some cases having your data as available as possible is the best thing you can do for your business. Letting them crawl and scrape creates means by which your product is found and advertised.
In other cases, like technical writing, you might want to protect the data. There is a danger that your content will be stolen and nothing will be given in return - traffic, money, references, etc.
Only if the GPT companies can resist the temptation of all that advertising $$$.
I'll give them at most 3 years before sponsored links begin appearing in the output and "AI optimization" becomes a fashionable service alongside the SEO snake oil. Most publishers won't care whether their content is mangled or not, as long as it is regurgitated with the right keywords and links.
All our customers were in North American but we let a semi naughty bot from the UK scan us and I will never understand why. It was still sending us malformed URLs we purged from the site years ago. WTF.
> but that ideal only holds between consenting adults.
If your webserver serves up the page, you've already pre-consented.
One of my retirement plans has a monthly statement available as a pdf document. We're allowed to download that. But the bot I wrote to download it once a month was having trouble, they used some fancy bot detection library to cockblock it. Wasn't allowed to use Mechanize. Why? Who the fuck knows. I'm only allowed to have that statement if I can be bothered to spend 15 minutes a month remembering how to fucking find it on their site and downloading it manually, rather than just saving a copy. Banks are even worse... they won't keep a copy of your statements longer than 6 months, but go apeshit if you try to have those automatically downloaded.
I don't ask permission or play nice anymore. Your robots.txt is ignorable, so I ignore it. I do what I want, and you're the problem not me.
With 1000s of bots per month and 10,000 hits on an ecommerce site, with product images, that's a lot of data transfer, and a lot of compute if your site has badly designed or no caching, rendering all the same page components millions of times over. But...
Part of the problem is all those companies who use AWS "standard practice" services, who assume the cost of bandwidth is just what AWS charges, and compute-per-page is just what it is, and don't even optimise those (e.g. S3/EC2/Lambda instead of CloudFront).
I've just compared AWS egress charge against the best I can trivially get at Hetzner (small cloud VMs for bulk serving https cache).
You get an astonishing 392x(!) more HTTPS egress from Hetzner for the same price, or equivalently 392x cheaper for the same amount.
You can comfortably serve 100+ TB/month that way. With 10,000 pages times 1000 bots per month, that gives you 10MB per page, which is more than almost any eCommerce site uses, when you factor that bots (other than very badly coded bots) won't fetch the common resources (JS etc.) repeatedly for each page, only the unique elements (e.g. HTML and per-product images).
Yeah, there were times, even running a fairly busy site, that the bots would outnumber user traffic 10:1 or more, and the bots loved to endlessly troll through things like archive indexs that could be computationally (db) expensive. At one point it got so bad that I got permission to just blackhole all of .cn and .ru, since of course none of those bots even thought of half obeying robots.txt. That literally cut CPU load on the database server by more than half.
In the last month, bot traffic has exploded to 10:1 due to LLM bots on my forum according to Cloudflare.
It would be one thing if it were driving more users to my forum. But human usage hasn't changed much, and the bots drop cache hit rate from 70% to 4% because they go so deep into old forum content.
I'd be curious to see a breakdown of what the bots are doing. On demand searches? General data scraping? I ended up blocking them with CF's Bot Blocker toggle, but I'd allow them if it were doing something beneficial for me.
For me (as I'm sure for plenty other people as well) limiting traffic to actual users matters a lot because I'm using a free tier for hosting in the time being. Bots could quickly exhaust it, and your website could be unavailable for the rest of the current "free billing" cycle, i.e. until your quota gets renewed.
This is how it's going. Half the websites I go to have Cloudflare captchas guarding them at this point. Every time I visit StackOverflow I get a 5 second wait while Cloudflare decides I'm kosher.
Are you using TOR or a VPN, spoofing your User-Agent to something uncommon, or doing something else that tries to add extra privacy?
That kind of user experience is one that I've seen a lot on HN, and every time, without fail, it's because they're doing something that makes them look like a bot, and then being all Surprised Pikachu when they get treated like a bot by websites.
I started having similar experiences when I switched to using Brave browser that blocks lots of tracking. Many websites that didn't show me those captchas and Cloudflare protection layers now have started to pop up on a regular basis.
It sucks more that Cloudflare/similar have responded to this with "if your handshake fingerprints more like curl than like Chrome/Firefox, no access for you".
There are tools like curl-impersonate: https://github.com/lwthiker/curl-impersonate out there that allow you to pretend to be any browser you like. Might take a bit of trial and error, but this mechanism could be bypassed with some persistence in identifying what is it that the resource is trying to block.
Or getting a CAPTCHA from Chrome when visiting a site you've been to dozens of times (Stack Overflow). Now I just skip that content, probably in my LLM already anyway.
It's the same thing as the anti pirate ads, you only annoy legit customers, this agressive captcha campaign just makes Stackoverflow drop down even faster than it would normally by making it lower quality.
I now write all of my bots in javascript and run them from the Chrome console with CORS turned off. It seems to defeat even Google's anti-bot stuff. Of course, I need to restart Chrome every few hours because of memory leaks, but it wasn't a fun 3 days the last time I got banned from their ecosystem with my kids asking why they couldn't watch Youtube.
I guess back in 2003 people would expect an email to actually go somewhere, these days I would expect it to either go nowhere or just be part of a campaign to collect server admin emails for marketing/phishing purposes. Angry emails are always a bit much, but I wonder if they aren't sent as much anymore in general or if people just stopped posting them to point and laugh at and wonder what goes through people's minds to get so upset to send such emails.
My somewhat silly take on seeing a bunch of information like emails in a user agent string is that I don't want to know about your stupid bot. Just crawl my site with a normal user agent and if there's a problem I'll block you based on that problem. It's usually not a permanent block, and it's also usually setup with something like fail2ban so it's not usually an instant request drop. If you want to identify yourself as a bot, fine, but take a hint from googlebot and keep the user agent short with just your identifier and an optional short URL. Lots of bots respect this convention.
But I'm just now reminded of some "Palo Alto Networks" company that started dumping their garbage junk in my logs, they have the audacity to include messages in the user agent like "If you would like to be excluded from our scans, please send IP addresses/domains to: scaninfo@paloaltonetworks.com" or "find out more about our scans in [link]". I put a rule in fail2ban to see if they'd take a hint (how about your dumb bot detects that it's blocked and stops/slows on its own accord?) but I forgot about it until now, seems they're still active. We'll see if they stop after being served nothing but zipbombs for a while before I just drop every request with that UA. It's not that I mind the scans, I'd just prefer to not even know they exist.
I think a better solution would be to block all the traffic, but have a comment in robots.txt with a way to be added onto a whitelist to scrape the contents of the resource. This puts a burden of requesting the access on the owner of the bot, and if they really want that access, they can communicate it and we can work it out.
It's a nice option to have and maybe good in some cases. It reminds me of the nicety that some journalists do when requesting if they can use some video uploaded on social media for their show or piece. I do like the approach and shifting of first contact burden, as well as the general philosophical principle that blocking ought to be reversible and also temporary rather than permanent (though I also like the idea of exponential timeouts that can become effectively permanent). Still, I don't see myself ever doing anything like that. I'd still prefer to just not know about the bot at all, and if I did decide to perma-block them, unless the first contact comes with sufficient dollar signs attached I'm likely to ignore it entirely. I'm not usually in the mood for starting random negotiation with anybody.
I also tend to see the web from the "open web" dream perspective. By default no traffic is blocked. The burden of requesting is already inherently done with a client -- they request a route, and I serve it or not. For things like my blog I don't tend to care who is requesting a particular route -- even admin pages can be requested, they just don't get anything without being logged in. If someone is being "cute" requesting non-existent wordpress pages or what have you, searching for vulnerabilities, or have an annoying/ugly user agent string, or are just pounding me for no real reason, then I do start to care. (The "pounding" aspect is a bit trickier -- I look at steady state. Another comment mentioned cutting their db server's cpu load in half by dropping unlikely-to-be-real-users from two countries. For me, if that is merely a steady state reduction from like 10% of a machine to 5%, I don't really care, I start caring when it would get in the way of real growth without having to use more resources.)
When I was hosting on EC2, I used to have very mild anxiety that I'd piss off someone and they'd try to "harm" me by launching a botnet of requests at large media files and rack up bandwidth costs. (I believe it when some people say this has happened more organically with normal bots in the age of LLMs, but my concern was more targeted botnets/ddos.) There are a few ways to mitigate that anxiety: 1) setup monitoring, alerts, and triggers directly in code running on the instance itself or via overseeing AWS tools (I did the latter, which is less reliable, but still. There was a threshold to shutdown the whole instance, minimizing the total damage possible to something like under a couple hundred bucks, I forget the details of trying to calculate how much traffic could theoretically be served before the monitoring side noticed) 2) hide behind cloudflare and their unlimited bandwidth, as my content was mostly static (I didn't do that) 3) move/rearchitect to a free host like github pages, give up hosting my own comments (again didn't do) 4) move to OVH which has unlimited bandwidth (did this when Amazon wanted to start charging an absurd amount for just a single ipv4 address).
I can see how it could lead to more overhead when communicating with the requesters. That could be a lot in the event that lots of them might want to crawl your resource.
I can see the argument that if I want to hide something, I should put it behind the layer of authentication. Robots is not a substitution for proper access control mechanisms. It is more of a "if they do honor this document, this would reduce the unnecessary traffic to my site" notion.
I appreciate you highlighting your personal experience in dealing with bots! I like the ideas of monitoring and being behind something like Cloudflare tools which would protect against the major influx of traffic. I think this is especially important for smaller sites which either use low or free tiers of cloud services.
It's just that people are suspicious of unknown crawlers, and rightly so.
Since it is impossible to know a priori which crawler are malicious, and many are malicious, it is reasonable to default to considering anything unknown malicious.
The problem with robots.txt is the reliance on identity rather purpose of the bots.
The author had blocked all bots because they wanted to get rid of AI scrapers. Then they wanted to unblock bots scraping for OpenGraph embeds so they unblocked...LinkedIn specifically. What if I post a link to their post on Twitter or any of the many Mastodon instances? Now they'd have to manually unblock all of their UA, which they obviously won't, so this creates an even bigger power advantage to the big companies.
What we need is an ability to block "AI training" but allow "search indexing, opengraph, archival".
And of course, we'd need a legal framework to actually enforce this, but that's an entirely different can of worms.
I think there is a long standing question about what robots.txt is for in general. In my opinion it was originally (and still is mostly) intended for crawlers. It is for bots that are discovering links and following them. A search engine would be an obvious example of a crawler. These are links that even if discovered shouldn't be crawled.
On the other end is user-requested URLs. Obviously a browser operated by a human shouldn't consider robots.txt. Almost as obviously, a tool subscribing to a specific iCal calendar feed shouldn't follow robots.txt because the human told it to access that URL. (I recall some service, can't remember if it was Google Calendar or Yahoo Pipes or something else that wouldn't let you subscribe to calendars blocked by robots.txt which seemed very wrong.)
The URL preview use case is somewhat murky. If the user is posting a single link and expecting it to generate a preview this very much isn't crawling. It is just fetching based on a specific user request. However if the user posts a long message with multiple links this is approaching crawling that message for links to discover. Overall I think this "URL preview on social media" probably shouldn't follow robots.txt but it isn't clear to me.
This is just a problem of sharing information in band instead of out of band. The OpenGraph metadata is in band with the page content that doesn't need to be shared with OpenGraph bots. The way to separate the usage is to separate the content and metadata with some specific query using `content-type` or `HEAD` or something, then bots are free to fetch that (useless for AI bots) and you can freely forbid all bots from the actual content. Then you don't really need much of a legal framework.
I like the idea of using HEAD or OPTIONS methods and have all bots access that so that they get a high level idea of what's going on, without the access to actual content, if the owner decided to block it.
I do like your suggestion of creating some standard that categorizes using function or purpose like you mention. This could simplify things granted that there is a way to validate the function and for spoofing to be hard to achieve. And yes - there is also legal.
I do think that I will likely need to go back and unblock a couple of other bots for this exact reason - so that it would be possible to share it and have previews in other social media. I like to take a slow and thoughtful approach to allowing this traffic as I get to learn what it is that I want and do not want.
Comments here have been a great resource to learn more about this issue and see what other people value.
I try to stay away from negative takes here, so I’ll keep this as constructive as I can:
It’s surprising to see the author frame what seems like a basic consequence of their actions as some kind of profound realization. I get that personal growth stories can be valuable, but this one reads more like a confession of obliviousness than a reflection with insight.
it's mostly that they didn't think of the page preview fetcher as a "crawler", and did not intend for their robots.txt to block it. it may not be profound but it's at the least not a completely trivial realisation. and heck, an actual human-written blog post can okay improve the average quality of the web.
The bots are called "crawlers" and "spiders", which to me evokes the image of tiny little things moving rapidly and mechanically from one place to another, leaving no niche unexplored. Spiders exploring a vast web.
Objectively, "I give you one (1) URL and you traverse the link to it so you can get some metadata" still counts as crawling, but I think that's not how most people conceptualize the term.
It'd be like telling someone "I spent part of the last year travelling." and when they ask you where you went, you tell them you commuted to-and-fro your workplace five times a week. That's technically travelling, although the other person would naturally expect you to talk about a vacation or a work trip or something to that effect.
> Objectively, "I give you one (1) URL and you traverse the link to it so you can get some metadata" still counts as crawling, but I think that's not how most people conceptualize the term.
It’s definitely not crawling as robots.txt defines the term.
:
> WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.
You will see that reflected in lots of software that respects robots.txt. For instance, if you fetch a URL with wget, then it won’t look at robots.txt. But if you mirror a site with wget, then it will fetch the initial URL, then it will find the links in that page, then before fetching subsequent pages it will fetch and check robots.txt.
That's exactly it. It was one of those unintended consequences of blocking everything that led me down the road of figuring it out.
Like other commenters have indicated, I will likely need to go back and allow some other social media to access the OPG data for previews to render properly. But since I mostly post on LinkedIn and HN, I don't feel the need to go and allow all of the other media at the moment. That might change in the future.
I mean it was a realization for me, although I wouldn't call it profound. To your point, it was closer to obliviousness, which led me to learn more about Open Graph Protocol details and how Robots Exclusion Protocol works.
I try to write about things that I learn or find interesting. Sharing it here in the hopes that others might find it interesting too.
I agree, and I am also confused on how this got on the frontpage of all things.
It's like reading a news article of 'water is wet'.
You block things -> of course good actors will respect and avoid you -> of course bad actors will just ignore it as it's a piece of "please do not do this" not a firewall blocking things.
Honestly, I am also surprised how this got on the frontpage. This was supposed to be a small post of what I have learnt in the process of fixing my LinkedIn previews. I don't know how we got here.
Another common unintended consequence I've seen is conflating crawling and indexing with regards to robots.txt.
If you make a new page and never want it to enter the Google search index, adding it to robots.txt is fine, Google will never crawl it and it will never enter the index.
If you have an existing page that is currently indexed and want to remove it, adding that page to robots.txt is a BAD idea though. In the short term Google will continue to show the page in search results, but show it with no metadata (because it can't crawl it anymore). Even worse, Google won't notice up any noindex tags on the page, because robots.txt is blocking the page from being crawled!
Eventually Google will get the hint and remove the page from the index, but it can be a very frustrating time waiting for that to happen.
There are cases where Google might find a URL blocked in robots.txt (through external or internal links), and the page can still be indexed and show up in the search results, even if they can't crawl it. [1].
The only way to be sure that it will stay out of the results is to use a noindex tag. Which, as you mentioned, search engine bots need to "read" in the code. If the URL is blocked, the "noindex" cannot be read.
It is an interesting tidbit. I personally don't need Google to remove it from indexing. It is more of a "I don't care if they index it". I mostly care about the scrapping and not indexing. I do understand that these terms could be used interchangeably. In the past I might have conflated them.
You consider this about the Linkedin site but don't stop to think about other social networks. This is true about basically all of them. You may not post on Facebook, Bluesky, etc, but other people may like your links and post them there.
> But what about being closer to the top of the Google search results - you might ask? One, search engines crawling websites directly is only one variable in getting a higher search engine ranking. References from other websites will also factor into that.
Kinda .... it's technically true that you can rank in Google if you block them in robots.txt but it's going to take a lot more work. Also your listing will look worse (last time I saw this there was no site description, but that was a few years back). If you care about Google SEO traffic you maybe want to let them on your site.
1) I only considered LinkedIn alone since I have been posting there and here on HN, and that's it. I figured I will let it play out until I need to allow for more bots to access it. Your suggestion of other people wanting to share the links to the blog is a very valid one that I haven't thought about. I might need to allow several other platforms.
2) With Google and other search engines I have seen a trend towards the AI summaries. It seems like this is the new meta for search engines. And with that I believe it will reduce the organic traffic from those engines to the websites. So, I do not particularly feel that this is a loss for me.
I might eat my words in the future, but right now I think that social media and HN sharing is what will drive the most meaningful traffic to my blog. It is a word-of-mouth marketing, that I think is a lot more powerful than finding my blog in a Google search.
I will definitely need to go back and do some more research on this topic to make sure that I'm not shooting myself in the foot with these decisions. Comments here have been very helpful in considering other opinions and options.
Point 2 is true to an extent, assuming you aren't monetizing your traffic though, wouldn't earning citations still be more valuable than not showing up in Google at all?
You should also consider that a large proportion of search is purely navigational. What if someone is trying to find your blog and is typing 'evgenii pendragon'. AI summaries don't show for this kind of search.
Currently I can see your site is still indexed (disallowing Google won't remove a page from the index if it's already been indexed) and so you show in search results, but because you block Google in robots.txt it can't crawl content and so shows an awkward 'No information is available for this page.' message. If you continue to block Google in robots.txt eventually the links to your site will disappear entirely.
Even if you really don't want to let AI summarize your content, I would at least allow crawlers to your homepage.
That is what I first thought of when I started seeing more comments noting that it might be a good idea to allow for search engine traffic. All of these suggestions lead me to think about it some more. I definitely will need to understand the domain of SEO and search engine indexing a little deeper if I want to make an educated decision about it and not shoot myself in the foot in the long run.
I think we are seeing the death of what was left of the open web, as people react to inconsiderate crawling for uses (AI) they are not sympathetic with by deciding trying to ban all automated access is the way to go. :(
The result will be that giant corporations and those will bad intents will still find a way to access what they need, but small, hobby citizen and civil society efforts will be blocked out.
I believe that this might give birth to new tools, protocols, and tech which will enable the next evolution of the Open Web into something akin to Protected Open Web.
I very much dislike the invasive scrapping approaches. If something were to be done about it, it would result in a new way that clients interact with resources on the web.
I think a paragraph could have been enough to describe the issue.
My goal with this post was to describe my personal experience with the problem, research, and the solution - the overall journey. I also wanted to write it in a manner that a non-technical person would be able to follow. Hence, being more explicit in my elaborations.
Weirdly, this is something that Apple actually gets right - the little „previews” you get when sharing links in iMessage get generated client-side; _by the sender_.
There are good reasons why you’d not want to rely on clients providing this information when posting to LinkedIn (scams, phishing, etc); but it’s interesting to see an entirely different approach to the problem used here.
I came here to write that I expected that clients should generate previews after receiving a link inside a message. I also expected that somebody else would have already pointed that out and here we are.
However I also understand that there are a number of reasons for a server to scrape the link. In no particular order:
1. scraping all the things, they might be useful sometimes in the future.
2. the page changes, goes 404, the client is reset and loses its db and can't rebuild the preview, but the client can rely on the server for that
3. it's faster for the client as the preview comes with the message and it does not have to issue some extra calls and parse a possibly slow page.
Anyway you write that's the sender that generates the preview on iMessages so that's leaves point #1 and possibly the part of #2 about flaky internet connections: the server is in a better place to keep trying to generate the preview.
A curious detail I noticed: the author’s account is a bit older than a month, yet he managed to publish four of his own articles (and nothing more, obviously). All of them (three) has zero comments and two of those three are dead.
Yet, this post of his (posting his own work) gained traction. I believe for robots.txt topic rather than the article itself.
That shows that even if you ignore all the rules of keeping a healthy community (not publish your self promotion only), eventually you’d get traction and nobody would care, I guess.
Quick edit: my bad, wrong click brought me to the wrong location. So I made a bit wrong assumption. The author posted 4 extra posts alongside his own, so it’s not 100% of self-promotion, but 50%.
To be honest, I also think that the discussion has become more interesting than what I wrote in the article. I have learned quite a few things for myself. And on top of that it has been great to hear some feedback in regards to some thoughts that I wrote.
There are more articles available in my blog than I have shared here. I don't think that everything that I write is shareworthy on HN. There are some that I find to be more interesting. Those are the ones I end up sharing.
Like you have noticed, I try to share other interesting resources that I find online too. Is there a ratio of self/others content that would help keeping a healthy community?
I don’t consider HN a healthy community, so I believe this approach of self-promoting the only sane thing to actually deal with this website. So, no real judgement from me. I just find it curious that apparently some mod banned you, but this time it was too late, since the discussion became somewhat interesting.
If it was some other website, I’d say that if you post many of your own posts yourself many community members won’t be happy. I’d say go with not 50% but, say, at least 20% or even 10%. Again, my personal opinion of this website, it’s long beyond repair (if it ever was), so feel free to do whatever you want, unless some whims of mods ban you for no real reason one day.
Yeah, I don't even think I 100% understand how the algorithm and moderation works here.
I will consider to bring in more content that is other than my own. I did read the guidelines for HN, and I saw that it encourages members to share what they find, and occasionally their own stuff. Appreciate your opinion!
The problem is the robots that do follow robots.txt its all the bots that don't. Robots.txt is largely irrelevant now they don't represent most of the traffic problem. They certainly don't represent the bots that are going to hammer your site without any regard, those bots don't follow robots.txt.
The argument is basically to have them scrape your website indefinitely wasting their resources for the bots that decide to ignore your robots.txt (or any bot if you desire)
While this may work today, the bots now more and more use if not the full headless rendering then at least apply CSS rules not to fetch invisible content.
I wonder whether the path in the robots.txt (or maybe a <link> tag with a bogus rel attribute) would already be enough to make evil bots follow it. That would at least avoid accidental follows due to CSS/ARIA not properly working in weird constellations
I would hope that all screen readers would respect display:none. The aria-hidden is for CYA, banning even one blind user would be quite bad optics (as is this sentence now that I think about it).
Not sure why you were downvoted. I have zero confidence that OpenAI, Anthropic, and the rest respect robots.txt however much they insist they do. It's also clear that they're laundering their traffic through residential ISP IP addresses to make detection harder. There are plenty of third-parties advertising the service, and farming it out affords the AI companies some degree of plausible deniability.
One way to test if the model respects robots.txt could be to ask the model if it can scrape a given URL. One thing that it doesn't address though is the scrapping of the training data for the model. That area feels more like a wild west.
Nobody has any confidence in ai to not ddos. That's why there have been dozens of posts about how bandwidth is becoming an issue for many websites as bots continuously crawl their sites for new info.
Wikipedia has backups for this reason. AI companies ignore the readily available backups and instead crawl every page hundreds of times a day.
The "full solution" to this, of course, is micropayments. A bot which has to pay a tenth of a cent to you every time it visits one of your pages or something else the page 404s will quickly rack up a $10 bill crawling a whole 10,000 page site. If it tries to do that every day, or every hour, that's an excellent payday for you and a very compelling reason for almost all bots to blacklist your domain name.
A human being who stops by to spend 20 minutes reading your blog once won't even notice they've spent 1.2 cents leafing through. This technology has existed for a while, and yet very few people have found it a good idea to wrap around. There is probably a good reason for that.
The realistic solution is to probably just do some statistics and figure out who's getting on your nerves, and then ban them from your digital abode. Annoying, but people go a lot farther to protect their actual homes if they happen to live in high crime areas.
What astounds me is there are no readily available libraries crawler authors can reach for to parse robots.txt and meta robots tags, to decide what is allowed, and to work through the arcane and poorly documented priorities between the two robots lists, including what to do when they disagree, which they often do.
Yes, there's an ancient google reference parser in C++11 (which is undoubtedly handy for that one guy who is writing crawlers in C++), but not a lot for the much more prevalent Python and JavaScript crawler writers who just want to check if a path is ok or not.
Even if bot writers WANT to be good, it's much harder than it should be, particularly when lots of the robots info isn't even in the robots.txt files, it's in the index.html meta tags.
rel=nofollow is a bad name. It doesn’t actually forbid following the link and doesn’t serve the same purpose as robots.txt.
The problem it was trying to solve was that spammers would add links to their site anywhere that they could, and this would be treated by Google as the page the links were on endorsing the page they linked to as relevant content. rel=nofollow basically means “we do not endorse this link”. The specification makes this more clear:
> By adding rel="nofollow" to a hyperlink, a page indicates that the destination of that hyperlink should not be afforded any additional weight or ranking by user agents which perform link analysis upon web pages (e.g. search engines).
> nofollow is a bad name […] does not mean the same as robots exclusion standards
The "good" bot writers rarely have enough resources to demolish servers blindly, and are generally more careful whether or not you make it easier, so there's not much incentive.
> If you don't want people to crawl your content, don't put it online.
I sometimes put things online for specific people to view/use, or for my own purposes. That gets an “all crawlers can do one” robots.txt and sometimes a script or two to try waste a little the time of those that don't play ball.
It is online because I want it online, not for some random to hoover up.
I consider robots.txt as a garden gate. I know it isn't secure, but likewise someone peering directly into my back bedroom window knows just as well that I don't want them there.
I could put stuff like that behind authentication, but that is hassle for both me and the people who I want to be accessing the stuff. I usually use not-randomly-guessable URIs though sometimes that is inconvenient too, and anyway they do sometimes get “guessed”. I must have at least one friend-of-friend who has an active infestation which is reading their messages or browser history for things to probe because the traffic pattern isn't just preview generation, I've had known AI crawlers pass by some things repeatedly.
TBH I don't really care that much, much at all in fact, I just don't like the presumption that my stuff is free for commercial exploitation.
robots.txt seems to be an irresistible attractor for some, most recently in a crusade against all kinds of GenAI.
I get not wanting to have our data serve as training data, but I've also seen moderately large newspapers throwing literally all LLM bots in there, i.e. not only those that scrape training data, but also those that process users' search requests or even direct article fetches.
The obvious, but possibly not expected, result was that this newspaper became effectively invisible to user searches in ChatGPT. Not sure if I'm an outlier here, but I personally click through to many of the sources ChatGPT provides, so they must be losing tons of traffic that way.
Having worked on bot detection in the past. Some really simple old fashioned attacks happened by doing the opposite of what the robots.txt file says.
While I doubt it does much today, that file really only matters to those that want to play by the rules which on the free web is not an awful lot of the web anymore I’m afraid.
That was the first thing that I have learnt about the robots.txt file. Even RFC 9309 Robots Exclusion Protocol document: https://www.rfc-editor.org/rfc/rfc9309.html - mentions:
> These rules are not a form of access authorization.
Meaning that these are not enforced in any way. They cannot prevent you from accessing anything really.
I think the only approach that could work in this scenario would be to find which companies disregard the robots.txt, and bring it to the attention of technical community. Practices like these could make the company look shady and untrustworthy if found out. That could be one way to keep them accountable, even though there is still no guarantee they will abide by it.
This is a kind of scream test, even if self-inflicted. Scream tests are usually a good way to discover actual usage in complex (or not so complex) systems.
> But what about being closer to the top of the Google search results - you might ask? One, search engines crawling websites directly is only one variable in getting a higher search engine ranking. References from other websites will also factor into that.
As far as I remember from google search console, a disallow directive in robots.txt causes google not only to avoid crawling the page, but also to eventually remove the page from its index. It certainly shouldn't add any more pages to its index, external references or not.
Thank you for that note! I didn't know that. It is something I will need to figure out: either I'm ok with not being in the search engine OR I will update the robots.txt. Currently I'm relying on the social media traffic and the word-of-mouth marketing.
This reminds me of an old friend of mine who wrote long revelation posts on how he started using the "private" keyword in C++ after compiler helped him to find why a class member changed unexpectedly and how he no longer drives car with the clutch half-pressed because it burns the clutch.
if you are hosting a house party that invites the entire world robots.txt is a neon sign to guide guests to where the beers are, who's cooking what kind of burgers and on what grill; rules of the house etc - you'll still have to secure your gold chains and laptop in a safe somewhere or decide to even keep them in the same house yourself
Gold chains etc should be behind authentication. Robots txt is more like a warning sign that says the hedgemaze in the garden goes on forever, so probably stay out of it.
This doesn't seem like a new discovery at all - this is what news publications have been dealing with ever since they went online.
You aren't going to get advertising without also providing value - be that money or information. Google has over 2 trillion in capitalization based primarily on the idea of charging people to get additional exposure, beyond what the information on their site otherwise would get.
I believe that as search engines continue to move toward AI summaries and responses, it will reduce the traffic to the websites since most people will be ok accepting the answers that the AI gave them.
My approach right now is to rely on social media traffic primarily where you can engage with the readers and build trust with the audience. I don't plan on using any advertising in the near future. While that might change, I am convinced that more intentional referral traffic will generate more intentional engagement.
LinkedIn is by far the worst offender in post previews. The doctype tag must be all lowercase. The HTML document must be well-formed (the meta tags must be in an explicit <head> block, for example). You must have OG meta tags for url, title, type, and image. The url meta tag gets visited, even if it's the same address the inspector is already looking at.
Fortunately, the post inspector helps you suss out what's missing in some cases, but c'mon, man, how much effort should I spend helping a social media site figure out how to render a preview? Once you get it right, and to quote my 13 year old: "We have arrived, father... but at what cost?"
Worst offenders I come across: official government information that needs to be public, placed behind Cloudflare, preventing even their M2M feeds (RSS, Atom, ...) to be accessed
No logins, but browser fingerprinting, behavior tracking, straight blocking, captchas and other 'verification', even creating bogus honeypots and serving up unrelated generated spam instead of real page content.
Maybe he is talking about stuff you're required by law to disclose but you don't really want to be seen too much. Like code of conduct, terms of service, retractions or public apologies.
Yes, there's often not much reason to block bots that abide by the rules. It just makes your site not show up on other search indexes and introduces problems for users. Malicious bots won't care about your robots.txt anyway.
This is kinda amusing.
robots.txt main purpose back in the day was curtailing penalties in the search engines when you got stuck maintaining a badly-built dynamic site that had tons of dynamic links and effectively got penalized for duplicate content. It was basically a way of saying "Hey search engines, these are the canonical URLs, ignore all the other ones with query parameters or whatever that give almost the same result."
It could also help keep 'nice' crawlers from getting stuck crawling an infinite number of pages on those sites.
Of course it never did anything for the 'bad' crawlers that would hammer your site! (And there were a lot of them, even back then.) That's what IP bans and such were for. You certainly wouldn't base it on something like User-Agent, which the user agent itself controlled! And you wouldn't expect the bad bots to play nicely just because you asked them.
That's about as naive as the Do-Not-Track header, which was basically kindly asking companies whose entire business is tracking people to just not do that thing that they got paid for.
Or the Evil Bit proposal, to suggest that malware should identify itself in the headers. "The Request for Comments recommended that the last remaining unused bit, the "Reserved Bit" in the IPv4 packet header, be used to indicate whether a packet had been sent with malicious intent, thus making computer security engineering an easy problem – simply ignore any messages with the evil bit set and trust the rest."
It should be noted here that the Evil Bit proposal was an April Fools RFC https://datatracker.ietf.org/doc/html/rfc3514
While we're at it, it should be noted that Do Not Track was not, apparently, a joke.
It's the same as a noreply email, if you can get away with sticking your fingers in your ears and humming when someone is telling you something you don't want to hear, and you have a computer to hide behind, then it's all good.
There should be a law against displaying a cookie consent box to a user who has their Do Not Track header set.
Not all that far-fetched, Global Privacy Control is legally binding in California.
https://en.wikipedia.org/wiki/Global_Privacy_Control
https://news.ycombinator.com/item?id=43377867
How is "Do Not Track" is a joke, but website presenting a button "Do not use cookies" is not? What's the difference?
It is ridiculous, but it is what you get when you have conflicting interests and broken legislation. The rule is that tracking has to be opt-in, so websites do it the way they are more likely to get people to opt in, and it is a cookie banner before you access the content.
Do-not-track is opt-out, not opt-in, and in fact, it is not opt-anything since browsers started to set it to "1" by default without asking. There is no law forcing advertisers to honor that.
I guess it could work the other way: if you set do-not-track to 0 (meaning "do-track"), which no browser does by default, make cookies auto-accept and do not show the banner. But then the law says that it should require no more actions to refuse consent than to consent (to counter those ridiculous "accept or uncheck 100 boxes" popups), so it would mean they would also have to honor do-not-track=1, which they don't want to.
I don't know how legislation could be unbroken. Users don't want ads, don't want tracking, they just want the service they ask for and don't want to pay for it. Service providers want exactly the opposite. Also people need services and services need users. There is no solution that will satisfy everyone.
> since browsers started to set it to "1" by default without asking
IIRC IE10 did that, to much outcry because it upended the whole DNT being an explicit choice; no other browser (including Edge) set it as a default.
There have been thoughts about using DNT (the technical communication mechanism about consent/objection) in correlation with GDPR (the legal framework to enforce consent/objection compliance)
https://www.w3.org/blog/2018/do-not-track-and-the-gdpr/
The GDPR explicitly mentions objection via technical means:
> In the context of the use of information society services, and notwithstanding Directive 2002/58/EC, the data subject may exercise his or her right to object by automated means using technical specifications.
https://law.stackexchange.com/a/90002
People like to debate as to whether DNT itself has enough meaning:
> Due to the confusion about this header's meaning, it has effectively failed.
https://law.stackexchange.com/a/90004
I myself consider DNT as what it means at face value: I do not want to be tracked, by anyone, ever. I don't know what's "confusing" about that.
The only ones that are "confused" are the ones it would be detrimental to i.e the ones that perform and extract value from the tracking, and make people run in circles with contrived explanations.
It would be perfectly trivial for a browser to pop up a permission request per website like there is for webcams or microphone or notifications, and show no popup should I elect to blanket deny through global setting.
Labor laws are not set to satisfy everyone, they are set such that a company cannot use it’s outsized power to exploit their workers, and that workers have fair chance at negotiating a fair deal, despite holding less power.
Similarly consumer protection laws—which the cookie banners are—are not set to satisfy everyone, they are set such that companies cannot use their outsized power to exploit their customers. A good consumer protection law will simply ban harmful behavior regardless of whether companies which engage in said harmful behavior want are satisfied with that ban or not. A good consumer protection law, will satisfy the user (or rather the general public) but it may satisfy the companies.
Good consumer protection laws are things like disclosure requirements or anti-tying rules that address information asymmetries or enable rather than restrict customer choice.
Bad consumer protection laws try to pretend that trade offs don't exist. You don't want to see ads, that's fine, but now you either need to self-host that thing or pay someone else money to do it because they're no longer getting money from ads.
There is no point in having an opt in for tracking. If the user can be deprived of something for not opting in (i.e. you can't use the service) then it's useless, and if they can't then the number of people who would purposely opt in is entirely negligible and you ought to stop beating around the bush and do a tracking ban. But don't pretend that's not going to mean less "free stuff".
The problem is legislators are self-serving. They want to be seen doing something without actually forcing the trade off that would annihilate all of these companies, so instead they implement something compromised to claim they've done something even though they haven't actually done any good. Hence obnoxious cookie banners.
That whole argument assumes that you as a consumer can always find a product with exactly the features you want. Because that's a laughable fiction, there need to be laws with teeth to punish bad behaviors that nearly every product would indulge in otherwise. That means things like requiring sites to get permission to track, and punishing those that track users without permission. It's a good policy in theory, but it needs to be paired with good enforcement, and that's where things are currently lacking.
> That's whole argument assumes that you as a consumer can always find a product with exactly the features you want. Because that's a laughable fiction
There are very many industries where this is exactly what happens. If you want a stack of lumber or a bag of oranges, it's a fungible commodity and there is no seller who can prevent you from buying the same thing from someone else if you don't like their terms.
If this is ever not the case, the thing you should be addressing is that, instead of trying to coerce an oligopoly that shouldn't exist into behaving under the threat of government penalties rather than competitive pressure. Because an uncompetitive market can screw you in ten thousand different ways regardless of whether you've made a dozen of them illegal.
> That means things like requiring sites to get permission to track, and punishing those that track users without permission. It's a good policy in theory, but it needs to be paired with good enforcement, and that's where things are currently lacking.
It's not a good policy in theory because the theory is ridiculous. If you have to consent to being tracked in exchange for nothing, nobody is going to do that. If you want a ban on tracking then call it what it is instead of trying to pretend that it isn't a ban on the "free services in exchange for tracking data" business model.
I think you might be misunderstanding the purpose of consumer protection. It is not about consumer choice, but rather it is about protecting consumer from the inherent power imbalance that exists between the company and their customers. If there is no way to doing a service for free without harming the customers, this service should be regulated such that no vendor is able to provide this service for free. It may seem punishing for the customers, but it is not. It protects the general public from this harmful behavior.
I actually agree with you that cookie banners are a bad policy, but for a different reason. As I understand it there are already requirements that the same service should also be available to opt-out users, however as your parent noted, enforcement is an issue. I, however, think that tracking users is extremely consumer hostile, and I think a much better policy would be a simple ban on targeted advertising.
> I think you might be misunderstanding the purpose of consumer protection. It is not about consumer choice, but rather it is about protecting consumer from the inherent power imbalance that exists between the company and their customers.
There isn't an inherent power imbalance that exists between the company and their customers, when there is consumer choice. Which is why regulations that restrict rather than expand consumer choice are ill-conceived.
> If there is no way to doing a service for free without harming the customers, this service should be regulated such that no vendor is able to provide this service for free.
But that isn't what those regulations do, because legislators want to pretend to do something while not actually forcing the trade off inherent in really doing the thing they're only pretending to do.
> I, however, think that tracking users is extremely consumer hostile, and I think a much better policy would be a simple ban on targeted advertising.
Which is a misunderstanding of the problem.
What's actually happening in these markets is that we a) have laws that create a strong network effect (e.g. adversarial interoperability is constrained rather than required) which means that b) the largest networks win, and the networks available for free then becomes the largest.
Which in turn means you don't have a choice, because Facebook is tracking everyone but everybody else is using Facebook, which means you're stuck using Facebook.
If you ban the tracking while leaving Facebook as the incumbent, two things happen. First, those laws are extremely difficult to enforce because neither you nor the government can easily tell what they do with the information they inherently get from the use of a centralized service, so they aren't effective. And second, they come up with some other business model -- which will still be abusive because they still have market power from the network effect -- and then get to blame the new cash extraction scheme on the law.
Whereas if you do what you ought to do and facilitate adversarial interoperability, that still sinks their business model, because then people are accessing everything via user agents that block tracking and ads, but it does it while also breaking their network effect by opening up the networks so they can't use their market power to swap in some new abusive business model.
For one, Do Not Track is on the client side and you just hope and pray that the server honors it, whereas cookie consent modals are something built by and placed in the website.
I think you can reasonably assume that if a website went through the trouble of making such a modal (for legal compliance reasons), the functionality works (also for legal compliance reasons). And, you as the client can verify whether it works, and can choose not to store them regardless.
> And, you as the client can verify whether it works
How do you do that? Cookies are typically opaque (encrypted or hashed) bags of bits.
Just the presence or absence of the cookie.
I would assume most websites would still set cookies even if you reject the consent, because the consent is only about not technically necessary cookies. Just because the website sets cookies doesn't tell you whether it respects you selection. Only if it doesn't set any cookies can you be sure, and I would assume that's a small minority of websites.
The goal with Do Not Track was legal (get governments to recognize it as the user declining consent for tracking and forbidding additional pop-ups) and not technological.
Unfortunately, the legal part of it failed, even in the EU.
Do Not Track had a chance to get into law, which if it did would be good that the code and standard were already in place.
I like the 128 bit strength indicator for how "evil" something is.
So it did the same work that a sitemap does? Interesting.
Or maybe more like the opposite: robots.txt told bots what not to touch, while sitemaps point them to what should be indexed. I didn’t realize its original purpose was to manage duplicate content penalties though. That adds a lot of historical context to how we think about SEO controls today.
> I didn’t realize its original purpose was to manage duplicate content penalties though.
That wasn’t its original purpose. It’s true that you didn’t want crawlers to read duplicate content, but it wasn’t because search engines penalised you for it – WWW search engines had only just been invented and they didn’t penalise duplicate content. It was mostly about stopping crawlers from unnecessarily consuming server resources. This is what the RFC from 1994 says:
> In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).
— https://www.robotstxt.org/orig.html
> It was mostly about stopping crawlers from unnecessarily consuming server resources.
Very much so.
Computation was still expensive, and http servers were bad at running cgi scripts (particularly compared to the streamlined amazing things they can be today).
SEO considerations came way way later.
They were also used, and still are, by sites that have good reasons to not want results in search engines. Lots of court files and transcripts, for instance, are hidden behind robots.txt.
> Computation was still expensive
I think this is still relevant today in cases where there are not many resources available: think free tiers, smallest fixed cost/fixed allocation scenarios, etc.
Robots.txt was created long before Google and before people were thinking about SEO:
https://en.wikipedia.org/wiki/Robots.txt
The scenario I remember was that the underfunded math department had an underpowered server connected via a wide and short pipe to the overfunded CS department and webcrawler experiments would crash the math department's web site repeatedly.
With the advent of AI and the notion of actually going to a website as being quaint: each website should have a humans.txt such as https://www.netflix.com/humans.txt or https://www.google.com/humans.txt
I have not heard of humans.txt before. It is apparently used for acknowledgement and crediting the dev team who created the resource.
What everybody is missing is that AI inference (not training) is a route out of the enshittification economy. One reason why Cloudflare is harassing you all the time to click on traffic lights and motorcycles is to slam the door from some of the exit routes.
It is so interesting to track this technology's origin back to the source. It makes sense that it would come from a background of limited resources where things would break if you overwhelm it. It didn't take much to do so.
Yup. Robots.txt was a don't-swamp-me thing.
> And you wouldn't expect the bad bots to play nicely just because you asked them.
Well, yes, the point is to tell the bots what you've decided to consider "bad" and will ban them for. So that they can avoid doing that.
Which of course only works to the degree that they're basically honest about who they are or at least incompetent at disguising themselves.
I think it depends on the definition of bad.
I always consider "good" a bot that doesn't disguise itself and follows the robots.txt rules. I may not consider good the final intent of the bot or the company behind it, but the crawler behaviour is fundamentally good.
Especially considering the fact that it is super easy to disguise a crawler and not follow the robots conventions
Well you as the person running a website can define unilaterally what you consider good and bad. You may want bots to crawl everything, nothing or (most likely) something inbetween. Then you judge bots based on those guidelines. You know like a solicitor that rings your bell that has a text above it saying "No solicitors", certain assumptions can be made about those who ignore it.
Some people just believe that because someone says so, everyone will nicely obey and follow the rules, don't know maybe it is a cultural thing.
Or a positive belief in human nature.
I admit I'm one of those people. After decades where I should perhaps be a bit more cynical, from time to time I am still shocked or saddened when I see people do things that benefit themselves over others.
But I kinda like having this attitude and expectation. Makes me feel healthier.
I deeply agree with you, and I'd like to add:
Trust by default, also by default, never ignoring suspicious signals.
Trust is not being naïve, I find the confusion of both very worrying.
> Trust by default, also by default, never ignoring suspicious signals.
While I absolutely love the intent of this idea, it quickly falls apart when you're dealing with systems where you only get the signals after you've already lost everything of value.
You don't have to go as far as to straight up "trust by default". You can instead "give a chance" by default, which is the middle path.
Actually Veritasium has a great video about this. It's proven as the most effective strategy in monte carlo simulation.
EDIT: This one: https://youtu.be/mScpHTIi-kM
i like that Veritasium vid a lot, i've watched it a couple times. The thing is, there's no way to retaliate against a crawler ignoring robots.txt. IP bans don't work, user agent bans don't work, there's no human to shame on social media ether. If there's no way to retaliate or provide some kind of meaningful negative feedback then the whole thing breaks down. Back to the Veritasium video, if a crawler defects they reap the reward but there's no way for the content provider to defect so the crawler defects 100% of the time and gets 100% of the defection points. I can't remember when i first read the rfp for robots.txt but I do remember finding it strange that it was a "pretty please" request against a crawler that has a financial incentive to crawl as much as it can. Why even go through the effort to type it out?
EDIT: i thought about it for a min, i think in the olden days a crawler crawling every path through a website could yield an inferior search index. So robots.txt gave search engines a hint on what content was valuable to index. The content provider gained because their SEO was better (and cpu util. lower) and the search engine gained because their index was better. So there was an advantage to cooperation then but with crawlers feeding LLMs that isn't the case.
No robots.txt can't fix this.
Have you tried Anubis? It was all over the internet a few months ago. I wonder if it actually works well. https://github.com/TecharoHQ/anubis
This is a really cool tool. I haven't seen it before. Thank you for sharing it!
On their README.md they state:
> This program is designed to help protect the small internet from the endless storm of requests that flood in from AI companies. Anubis is as lightweight as possible to ensure that everyone can afford to protect the communities closest to them.
I love the idea!
It's easy to believe, though, and most of us do it every day. For example, our commute to work is marked by the trust that other drivers will cooperate, following the rules, so that we all get to where we are going.
There are varying degrees of this through our lives, where the trust lies not in the fact that people will just follow the rules because they are rules, but because the rules set expectations, allowing everyone to (more or less) know what's going on and decide accordingly. This also makes it easier to single out the people who do not think the rules apply to them so we can avoid trusting them (and, probably, avoid them in general).
In Southern Europe, and countries with similar cultures, we don't obey rules because someone says so, we obey them when we see that is actually reasonable to do so, hence my remark regarding culture as I also experienced living in countries where everyone mostly blindly follow the rules, even if they happen to be nonsense.
Naturally I am talking about cultures where that decision has not been taken away from their citizens.
> I also experienced living in countries where everyone mostly blindly follow the rules, even if they happen to be nonsense.
The problem with that is that most people are not educated enough to judge what makes sense and what doesn’t, and the less educated you are, the more likely you are to believe you know what makes sense when you’re actually wrong. These are exactly the people that should be following the rules blindly, until they actually put in the effort to learn why those rules exist.
I believe there is a difference between education and critical thinking. One may not have a certain level of education, but could exercise a great degree of critical thinking. I think that education can help you understand the context of the problem better. But there are also plenty of people who are not asking the right questions or not asking questions - period - who have lots of education behind them. Ironically, sometimes education is the path that leads to blind trust and lack of challenging the status quo.
> the less educated you are, the more likely you are to believe you know what makes sense
It actually frightens me how true this statement is.
To reinforce my initial position about how important the rules are for setting expectations, I usually use cyclists as an example. Many follow the proposed rules, understanding they are traffic, and right of way is not automagically granted based on the choice of vehicle, having more to do with direction and the flow of said traffic.
But there's always a bad apple, a cyclist who assumes themselves to be exempt from the rules and rides against the flow of traffic, then wonders why they got clipped because a right-turning driver wasn't expecting a vehicle to be coming from the direction traffic is not supposed to come from.
In the end, it's not really about what we drive or how we get around, but whether we are self-aware enough to understand that the rules apply to us, and collectively so. Setting the expectation of what each of our behaviors will be is precisely what creates the safety that comes with following them, and only the dummies seem to be the ones who think they are exempt.
As a French, being passed by the right by Italian drivers on the highway really makes me feel the superiority of Southern Europeans judgment over my puny habit of blindly following rules. Or does it?
But yes, I do the same. I just do not come here to pretend this is virtue.
The rules in France are probably different but passing on the right is legal on Italian highways, in one circumstance: if one keeps driving on the lane on the right and somebody slower happens to be driving on the lane on the left. The rationale is that it normally happens when traffic is packed, so it's ok even if there is little traffic. Everybody keep driving straight and there is no danger.
It's not legal if somebody is following the slower car on the left and steers to the right to pass. However some drivers stick to the left at a speed slower than the limit and if they don't yield what happens is that eventually they get passed on the right.
The two cases have different names. The normal pass is "sorpasso", the other one (passing by not steering) is "superamento", which is odd but they had to find a word for it.
Not sure if it is a virtue, but standing as a pedestrians in an empty street at 3 AM waiting for a traffic light to turn green doesn't make much sense either, it isn't as if a ghost car is coming out of nowhere.
It should be a matter of judgement and not following rules just because.
It makes sense as it allows to walk city streets safely on autopilot while thinking about other things.
I kind of agree. The rules for safety should be simple, straightforward, and protect you in the "edge cases", i.e. following while not paying 100% of attention, protect you with a malicious actor in mind aka reckless driver, etc. Ideally, in a system like that it should be a difficult and intentional behavior if one wanted to break the rules rather than to follow them.
One should not pass any street on “auto-pilot”, no matter if there’s a green light for pedestrians.
I agree. I mostly mean that it is good to strive towards a system of rules that will be easy to follow and difficult to break by default. That is an ideal case. In reality, it is never that simple.
> For example, our commute to work is marked by the trust that other drivers will cooperate, following the rules, so that we all get to where we are going.
That trust comes from the knowledge that it's likely that those drivers also don't want to crash, and would rather prefer to get where they're going.
I love the culturally specific implication that 'commute' == 'commute in the car' :)
I apologize for that. I try to mitigate my US-centricness in my comments as much as possible, understanding completely that I am speaking with a global audience, but I am definitely not perfect at it :D
I suppose the same goes if you take the tube, ride a bike, walk, etc? There's still rules in terms of behavior, flow of traffic (even foot traffic), etc, that helps set a number of expectations so everyone can decide and behave accordingly. Happy to hear different thoughts on this!
I still see the value in robots.txt and DNT as a clear, standardised way of posting a "don't do this" sign that companies could be forced to respect through legal means.
The GDPR requires consent for tracking. DNT is a very clear "I do not consent" statement. It's a very widely known standard in the industry. It would therefore make sense that a court would eventually find companies not respecting it are in breach of the GDPR.
That was a theory at least...
Would robot traffic be considered tracking in light of GDPR standards? As far as I know there are no regulatory rules in relation to enforcing robots behaviors outside of robots.txt, which is more of an honor system.
DNT and GDPR was just an example. In a court case about tracking, DNT could be found to be a clear and explicit opt out. Similarly, in a case about excessive scraping or the use of scraped information, robots txt could be used as a clear and explicit signal that the site operator does not want their pages harvested. It all but certainly gets rid of the "they put it on the public web so we assumed we can scrape it, we can'task everyone for permission" argument. They can't claim it was "in good faith" if there's a widely-accepted standard for opting out.
Fair enough. It should be sufficient to say one way or the other.
> That's about as naive as the Do-Not-Track header, which was basically kindly asking companies whose entire business is tracking people to just not do that thing that they got paid for.
It's usually a bad default to assume incompetence on the part of others, especially when many experienced and knowledgeable people have to be involved to make a thing happen.
The idea behind the DNT header was to back it up with legislation-- and sure you can't catch and prosecute all tracking, but there are limitations on the scale of criminal move fast and break things before someone rats you out. :P
I created a search engine that crawled the web way back in 2003. I used a proper user agent that included my email address. I got SO many angry emails about my crawler, which played as nice as I was able to make it play. Which was pretty nice I believe. If it’s not Google people didn’t want it. That’s a good way to prevent anyone from ever competing with Google. It isn’t just about that preview for LinkedIn, it’s about making sure the web is accessible by everyone and everything that is trying to make its way. Sure, block the malicious ones. But don’t just assume that every bot is malicious by default.
I definitely agree here. My initial response was to block everything, however you realize that web is complex and interdependent. I still believe that everyone should have autonomy over their online resources if they desire. But that comes with an intentionality behind it. If you want to allow or disallow certain traffic, you also should answer the question why or why not. That requires understanding what each bot does. That takes time and effort.
My foray into robots.txt started from the whole notion of AI companies training on everything they can put their hands on. I want to be able to have a say whether I allow it or not. While not all bots will honor the robots.txt file, there are plenty that do. One way that I found you can test that is by asking the model directly to scrape a particular link (assuming the model has browsing capabilities).
Bots are not malicious by default. It is what that company does with your data and how you feel about it that matters in the end.
The most annoying thing about being a good bot owner, in my experience, is when you get complaints about it misbehaving, only to find that it was actually somebody malicious who wrote their own abusive bot, but is using your bot's user agent.
Cloudflare have some new bot verification proposals designed to fix this, with cryptographic proofs that the user-agent is who they say they are: https://blog.cloudflare.com/web-bot-auth/.
It is awesome to see that there are efforts to improve upon the old standards with modern web security standards
That's easy to say when it's your bot, but I've been on the other side to know that the problem isn't your bot, it's the 9000 other ones just like it, none of which will deliver traffic anywhere close to the resources consumed by scraping.
True. Major search engines and bots from social networks have a clear value proposition: in exchange for consuming my resources, they help drive human traffic to my site. GPTBot et al. will probably do the same, as more people use AI to replace search.
A random scraper, on the other hand, just racks up my AWS bill and contributes nothing in return. You'd have to be very, very convincing in your bot description (yes, I do check out the link in the user-agent string to see what the bot claims to be for) in order to justify using other people's resources on a large scale and not giving anything back.
An open web that is accessible to all sounds great, but that ideal only holds between consenting adults. Not parasites.
> GPTBot et al. will probably do the same, as more people use AI to replace search.
It really won’t. It will steal your website’s content and regurgitate it back out in a mangled form to any lazy prompt that gets prodded into it. GPT bots are a perfect example of the parasites you speak of that have destroyed any possibility of an open web.
That was my hunch. My initial post on robots.txt: https://evgeniipendragon.com/posts/i-am-disallowing-all-craw... - revolved around blocking AI models from doing that because I do not believe that it will bring more traffic to my website - it will use the content to keep people using their service. I might be proven wrong in the future, but I do not see why they would want to let go of an extra opportunity to increase retention.
Which is all you need a lot of the time. If you're a hotel, or restaurant, or selling a product, or have a blog to share information important to you, then all you need is for the LLM to share it with the user.
"Yes, there are a lot of great restaurants in Chicago that cater to vegans and people who enjoy musical theater. Dancing Dandelions in River North is one." or "One way to handle dogs defecating in your lawn is with Poop-Be-Gone, a non-toxic product that dissolves the poop."
It's not great for people who sell text online (journalists, I guess, who else?). But that's probably not the majority of content.
You are bringing a great point. In some cases having your data as available as possible is the best thing you can do for your business. Letting them crawl and scrape creates means by which your product is found and advertised.
In other cases, like technical writing, you might want to protect the data. There is a danger that your content will be stolen and nothing will be given in return - traffic, money, references, etc.
Only if the GPT companies can resist the temptation of all that advertising $$$.
I'll give them at most 3 years before sponsored links begin appearing in the output and "AI optimization" becomes a fashionable service alongside the SEO snake oil. Most publishers won't care whether their content is mangled or not, as long as it is regurgitated with the right keywords and links.
What do you mean sponsored links? It'll be a sponsored reply, no outbound links required.
All our customers were in North American but we let a semi naughty bot from the UK scan us and I will never understand why. It was still sending us malformed URLs we purged from the site years ago. WTF.
> but that ideal only holds between consenting adults.
If your webserver serves up the page, you've already pre-consented.
One of my retirement plans has a monthly statement available as a pdf document. We're allowed to download that. But the bot I wrote to download it once a month was having trouble, they used some fancy bot detection library to cockblock it. Wasn't allowed to use Mechanize. Why? Who the fuck knows. I'm only allowed to have that statement if I can be bothered to spend 15 minutes a month remembering how to fucking find it on their site and downloading it manually, rather than just saving a copy. Banks are even worse... they won't keep a copy of your statements longer than 6 months, but go apeshit if you try to have those automatically downloaded.
I don't ask permission or play nice anymore. Your robots.txt is ignorable, so I ignore it. I do what I want, and you're the problem not me.
I'm confused why scraping is so resource intensive - it hits every URL your site serves? For an individual ecommerce site that's maybe 10,000 hits?
With 1000s of bots per month and 10,000 hits on an ecommerce site, with product images, that's a lot of data transfer, and a lot of compute if your site has badly designed or no caching, rendering all the same page components millions of times over. But...
Part of the problem is all those companies who use AWS "standard practice" services, who assume the cost of bandwidth is just what AWS charges, and compute-per-page is just what it is, and don't even optimise those (e.g. S3/EC2/Lambda instead of CloudFront).
I've just compared AWS egress charge against the best I can trivially get at Hetzner (small cloud VMs for bulk serving https cache).
You get an astonishing 392x(!) more HTTPS egress from Hetzner for the same price, or equivalently 392x cheaper for the same amount.
You can comfortably serve 100+ TB/month that way. With 10,000 pages times 1000 bots per month, that gives you 10MB per page, which is more than almost any eCommerce site uses, when you factor that bots (other than very badly coded bots) won't fetch the common resources (JS etc.) repeatedly for each page, only the unique elements (e.g. HTML and per-product images).
And the thousands of other bots also hitting those, together is far more than the legitimate traffic for many sites.
Yeah, there were times, even running a fairly busy site, that the bots would outnumber user traffic 10:1 or more, and the bots loved to endlessly troll through things like archive indexs that could be computationally (db) expensive. At one point it got so bad that I got permission to just blackhole all of .cn and .ru, since of course none of those bots even thought of half obeying robots.txt. That literally cut CPU load on the database server by more than half.
In the last month, bot traffic has exploded to 10:1 due to LLM bots on my forum according to Cloudflare.
It would be one thing if it were driving more users to my forum. But human usage hasn't changed much, and the bots drop cache hit rate from 70% to 4% because they go so deep into old forum content.
I'd be curious to see a breakdown of what the bots are doing. On demand searches? General data scraping? I ended up blocking them with CF's Bot Blocker toggle, but I'd allow them if it were doing something beneficial for me.
For me (as I'm sure for plenty other people as well) limiting traffic to actual users matters a lot because I'm using a free tier for hosting in the time being. Bots could quickly exhaust it, and your website could be unavailable for the rest of the current "free billing" cycle, i.e. until your quota gets renewed.
You’ve forgotten the combinatorics of query params.
10,000 is Monday. Morning.
Yes. That’s a lot of bandwidth, depending on the content of course.
> But don’t just assume that every bot is malicious by default.
I'll bite. It seems like a poor strategy to trust by default.
I'll bite harder. That's how the public Internet works. If you don't trust clients at all, serve them a login page instead of content.
This is how it's going. Half the websites I go to have Cloudflare captchas guarding them at this point. Every time I visit StackOverflow I get a 5 second wait while Cloudflare decides I'm kosher.
Are you using TOR or a VPN, spoofing your User-Agent to something uncommon, or doing something else that tries to add extra privacy?
That kind of user experience is one that I've seen a lot on HN, and every time, without fail, it's because they're doing something that makes them look like a bot, and then being all Surprised Pikachu when they get treated like a bot by websites.
I started having similar experiences when I switched to using Brave browser that blocks lots of tracking. Many websites that didn't show me those captchas and Cloudflare protection layers now have started to pop up on a regular basis.
In fairness this appears to be the direction we are headed anyway
It sucks that we're living in a landscape where bad actors take advantage of that way of doing things.
The really bad actors are going to ignore robots.txt entirely. You might as well be nice to the crawlers that respect robots.txt.
Even if you want to play nice, robots.txt is a catch-22, as accessing it is taken as a signal you are a 'bot' by malconfigured anti-bot 'solutions'.
It sucks more that Cloudflare/similar have responded to this with "if your handshake fingerprints more like curl than like Chrome/Firefox, no access for you".
There are tools like curl-impersonate: https://github.com/lwthiker/curl-impersonate out there that allow you to pretend to be any browser you like. Might take a bit of trial and error, but this mechanism could be bypassed with some persistence in identifying what is it that the resource is trying to block.
Or getting a CAPTCHA from Chrome when visiting a site you've been to dozens of times (Stack Overflow). Now I just skip that content, probably in my LLM already anyway.
Keep in mind that those LLMs are one of the bigger reasons why we see more and more anti bot behaviour on sites like SO.
That aggressive crawling to train those on everything is insane.
It's the same thing as the anti pirate ads, you only annoy legit customers, this agressive captcha campaign just makes Stackoverflow drop down even faster than it would normally by making it lower quality.
I now write all of my bots in javascript and run them from the Chrome console with CORS turned off. It seems to defeat even Google's anti-bot stuff. Of course, I need to restart Chrome every few hours because of memory leaks, but it wasn't a fun 3 days the last time I got banned from their ecosystem with my kids asking why they couldn't watch Youtube.
Bad actors will always exploit whatever systems are available to them. Always have, always will.
Because if they play by the rules, they won't be bad actors
I guess back in 2003 people would expect an email to actually go somewhere, these days I would expect it to either go nowhere or just be part of a campaign to collect server admin emails for marketing/phishing purposes. Angry emails are always a bit much, but I wonder if they aren't sent as much anymore in general or if people just stopped posting them to point and laugh at and wonder what goes through people's minds to get so upset to send such emails.
My somewhat silly take on seeing a bunch of information like emails in a user agent string is that I don't want to know about your stupid bot. Just crawl my site with a normal user agent and if there's a problem I'll block you based on that problem. It's usually not a permanent block, and it's also usually setup with something like fail2ban so it's not usually an instant request drop. If you want to identify yourself as a bot, fine, but take a hint from googlebot and keep the user agent short with just your identifier and an optional short URL. Lots of bots respect this convention.
But I'm just now reminded of some "Palo Alto Networks" company that started dumping their garbage junk in my logs, they have the audacity to include messages in the user agent like "If you would like to be excluded from our scans, please send IP addresses/domains to: scaninfo@paloaltonetworks.com" or "find out more about our scans in [link]". I put a rule in fail2ban to see if they'd take a hint (how about your dumb bot detects that it's blocked and stops/slows on its own accord?) but I forgot about it until now, seems they're still active. We'll see if they stop after being served nothing but zipbombs for a while before I just drop every request with that UA. It's not that I mind the scans, I'd just prefer to not even know they exist.
I think a better solution would be to block all the traffic, but have a comment in robots.txt with a way to be added onto a whitelist to scrape the contents of the resource. This puts a burden of requesting the access on the owner of the bot, and if they really want that access, they can communicate it and we can work it out.
It's a nice option to have and maybe good in some cases. It reminds me of the nicety that some journalists do when requesting if they can use some video uploaded on social media for their show or piece. I do like the approach and shifting of first contact burden, as well as the general philosophical principle that blocking ought to be reversible and also temporary rather than permanent (though I also like the idea of exponential timeouts that can become effectively permanent). Still, I don't see myself ever doing anything like that. I'd still prefer to just not know about the bot at all, and if I did decide to perma-block them, unless the first contact comes with sufficient dollar signs attached I'm likely to ignore it entirely. I'm not usually in the mood for starting random negotiation with anybody.
I also tend to see the web from the "open web" dream perspective. By default no traffic is blocked. The burden of requesting is already inherently done with a client -- they request a route, and I serve it or not. For things like my blog I don't tend to care who is requesting a particular route -- even admin pages can be requested, they just don't get anything without being logged in. If someone is being "cute" requesting non-existent wordpress pages or what have you, searching for vulnerabilities, or have an annoying/ugly user agent string, or are just pounding me for no real reason, then I do start to care. (The "pounding" aspect is a bit trickier -- I look at steady state. Another comment mentioned cutting their db server's cpu load in half by dropping unlikely-to-be-real-users from two countries. For me, if that is merely a steady state reduction from like 10% of a machine to 5%, I don't really care, I start caring when it would get in the way of real growth without having to use more resources.)
When I was hosting on EC2, I used to have very mild anxiety that I'd piss off someone and they'd try to "harm" me by launching a botnet of requests at large media files and rack up bandwidth costs. (I believe it when some people say this has happened more organically with normal bots in the age of LLMs, but my concern was more targeted botnets/ddos.) There are a few ways to mitigate that anxiety: 1) setup monitoring, alerts, and triggers directly in code running on the instance itself or via overseeing AWS tools (I did the latter, which is less reliable, but still. There was a threshold to shutdown the whole instance, minimizing the total damage possible to something like under a couple hundred bucks, I forget the details of trying to calculate how much traffic could theoretically be served before the monitoring side noticed) 2) hide behind cloudflare and their unlimited bandwidth, as my content was mostly static (I didn't do that) 3) move/rearchitect to a free host like github pages, give up hosting my own comments (again didn't do) 4) move to OVH which has unlimited bandwidth (did this when Amazon wanted to start charging an absurd amount for just a single ipv4 address).
I can see how it could lead to more overhead when communicating with the requesters. That could be a lot in the event that lots of them might want to crawl your resource.
I can see the argument that if I want to hide something, I should put it behind the layer of authentication. Robots is not a substitution for proper access control mechanisms. It is more of a "if they do honor this document, this would reduce the unnecessary traffic to my site" notion.
I appreciate you highlighting your personal experience in dealing with bots! I like the ideas of monitoring and being behind something like Cloudflare tools which would protect against the major influx of traffic. I think this is especially important for smaller sites which either use low or free tiers of cloud services.
It's just that people are suspicious of unknown crawlers, and rightly so.
Since it is impossible to know a priori which crawler are malicious, and many are malicious, it is reasonable to default to considering anything unknown malicious.
The problem with robots.txt is the reliance on identity rather purpose of the bots.
The author had blocked all bots because they wanted to get rid of AI scrapers. Then they wanted to unblock bots scraping for OpenGraph embeds so they unblocked...LinkedIn specifically. What if I post a link to their post on Twitter or any of the many Mastodon instances? Now they'd have to manually unblock all of their UA, which they obviously won't, so this creates an even bigger power advantage to the big companies.
What we need is an ability to block "AI training" but allow "search indexing, opengraph, archival".
And of course, we'd need a legal framework to actually enforce this, but that's an entirely different can of worms.
I think there is a long standing question about what robots.txt is for in general. In my opinion it was originally (and still is mostly) intended for crawlers. It is for bots that are discovering links and following them. A search engine would be an obvious example of a crawler. These are links that even if discovered shouldn't be crawled.
On the other end is user-requested URLs. Obviously a browser operated by a human shouldn't consider robots.txt. Almost as obviously, a tool subscribing to a specific iCal calendar feed shouldn't follow robots.txt because the human told it to access that URL. (I recall some service, can't remember if it was Google Calendar or Yahoo Pipes or something else that wouldn't let you subscribe to calendars blocked by robots.txt which seemed very wrong.)
The URL preview use case is somewhat murky. If the user is posting a single link and expecting it to generate a preview this very much isn't crawling. It is just fetching based on a specific user request. However if the user posts a long message with multiple links this is approaching crawling that message for links to discover. Overall I think this "URL preview on social media" probably shouldn't follow robots.txt but it isn't clear to me.
Something like Common Crawl can be used for both search and AI training (and any other purpose, decided after crawl is done).
Then such a crawler should mark itself with all purpose tags and thus be blocked in this scenario.
Alternatively, it could make the request anyways and separate the crawled sites by permitted purpose in its output.
This is just a problem of sharing information in band instead of out of band. The OpenGraph metadata is in band with the page content that doesn't need to be shared with OpenGraph bots. The way to separate the usage is to separate the content and metadata with some specific query using `content-type` or `HEAD` or something, then bots are free to fetch that (useless for AI bots) and you can freely forbid all bots from the actual content. Then you don't really need much of a legal framework.
I like the idea of using HEAD or OPTIONS methods and have all bots access that so that they get a high level idea of what's going on, without the access to actual content, if the owner decided to block it.
I do like your suggestion of creating some standard that categorizes using function or purpose like you mention. This could simplify things granted that there is a way to validate the function and for spoofing to be hard to achieve. And yes - there is also legal.
I do think that I will likely need to go back and unblock a couple of other bots for this exact reason - so that it would be possible to share it and have previews in other social media. I like to take a slow and thoughtful approach to allowing this traffic as I get to learn what it is that I want and do not want.
Comments here have been a great resource to learn more about this issue and see what other people value.
I try to stay away from negative takes here, so I’ll keep this as constructive as I can:
It’s surprising to see the author frame what seems like a basic consequence of their actions as some kind of profound realization. I get that personal growth stories can be valuable, but this one reads more like a confession of obliviousness than a reflection with insight.
And then they posted it here for attention.
it's mostly that they didn't think of the page preview fetcher as a "crawler", and did not intend for their robots.txt to block it. it may not be profound but it's at the least not a completely trivial realisation. and heck, an actual human-written blog post can okay improve the average quality of the web.
The bots are called "crawlers" and "spiders", which to me evokes the image of tiny little things moving rapidly and mechanically from one place to another, leaving no niche unexplored. Spiders exploring a vast web.
Objectively, "I give you one (1) URL and you traverse the link to it so you can get some metadata" still counts as crawling, but I think that's not how most people conceptualize the term.
It'd be like telling someone "I spent part of the last year travelling." and when they ask you where you went, you tell them you commuted to-and-fro your workplace five times a week. That's technically travelling, although the other person would naturally expect you to talk about a vacation or a work trip or something to that effect.
> Objectively, "I give you one (1) URL and you traverse the link to it so you can get some metadata" still counts as crawling, but I think that's not how most people conceptualize the term.
It’s definitely not crawling as robots.txt defines the term. :
> WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.
— https://www.robotstxt.org/orig.html
You will see that reflected in lots of software that respects robots.txt. For instance, if you fetch a URL with wget, then it won’t look at robots.txt. But if you mirror a site with wget, then it will fetch the initial URL, then it will find the links in that page, then before fetching subsequent pages it will fetch and check robots.txt.
That's exactly it. It was one of those unintended consequences of blocking everything that led me down the road of figuring it out.
Like other commenters have indicated, I will likely need to go back and allow some other social media to access the OPG data for previews to render properly. But since I mostly post on LinkedIn and HN, I don't feel the need to go and allow all of the other media at the moment. That might change in the future.
They posted it here because they wouldn't appear on Google otherwise (:
I mean it was a realization for me, although I wouldn't call it profound. To your point, it was closer to obliviousness, which led me to learn more about Open Graph Protocol details and how Robots Exclusion Protocol works.
I try to write about things that I learn or find interesting. Sharing it here in the hopes that others might find it interesting too.
I agree, and I am also confused on how this got on the frontpage of all things. It's like reading a news article of 'water is wet'.
You block things -> of course good actors will respect and avoid you -> of course bad actors will just ignore it as it's a piece of "please do not do this" not a firewall blocking things.
Honestly, I am also surprised how this got on the frontpage. This was supposed to be a small post of what I have learnt in the process of fixing my LinkedIn previews. I don't know how we got here.
Another common unintended consequence I've seen is conflating crawling and indexing with regards to robots.txt.
If you make a new page and never want it to enter the Google search index, adding it to robots.txt is fine, Google will never crawl it and it will never enter the index.
If you have an existing page that is currently indexed and want to remove it, adding that page to robots.txt is a BAD idea though. In the short term Google will continue to show the page in search results, but show it with no metadata (because it can't crawl it anymore). Even worse, Google won't notice up any noindex tags on the page, because robots.txt is blocking the page from being crawled!
Eventually Google will get the hint and remove the page from the index, but it can be a very frustrating time waiting for that to happen.
There are cases where Google might find a URL blocked in robots.txt (through external or internal links), and the page can still be indexed and show up in the search results, even if they can't crawl it. [1].
The only way to be sure that it will stay out of the results is to use a noindex tag. Which, as you mentioned, search engine bots need to "read" in the code. If the URL is blocked, the "noindex" cannot be read.
[1] https://developers.google.com/search/docs/crawling-indexing/... (refer to the red "Warning" section)
It is an interesting tidbit. I personally don't need Google to remove it from indexing. It is more of a "I don't care if they index it". I mostly care about the scrapping and not indexing. I do understand that these terms could be used interchangeably. In the past I might have conflated them.
Hey OP,
1)
You consider this about the Linkedin site but don't stop to think about other social networks. This is true about basically all of them. You may not post on Facebook, Bluesky, etc, but other people may like your links and post them there.
I recently ran into this as it turns out the Facebook entries in https://github.com/ai-robots-txt/ai.robots.txt also block the crawler FB uses for link previews.
2)
From your first post,
> But what about being closer to the top of the Google search results - you might ask? One, search engines crawling websites directly is only one variable in getting a higher search engine ranking. References from other websites will also factor into that.
Kinda .... it's technically true that you can rank in Google if you block them in robots.txt but it's going to take a lot more work. Also your listing will look worse (last time I saw this there was no site description, but that was a few years back). If you care about Google SEO traffic you maybe want to let them on your site.
Hey, @jarofgreen! Thank you for the feedback!
1) I only considered LinkedIn alone since I have been posting there and here on HN, and that's it. I figured I will let it play out until I need to allow for more bots to access it. Your suggestion of other people wanting to share the links to the blog is a very valid one that I haven't thought about. I might need to allow several other platforms.
2) With Google and other search engines I have seen a trend towards the AI summaries. It seems like this is the new meta for search engines. And with that I believe it will reduce the organic traffic from those engines to the websites. So, I do not particularly feel that this is a loss for me.
I might eat my words in the future, but right now I think that social media and HN sharing is what will drive the most meaningful traffic to my blog. It is a word-of-mouth marketing, that I think is a lot more powerful than finding my blog in a Google search.
I will definitely need to go back and do some more research on this topic to make sure that I'm not shooting myself in the foot with these decisions. Comments here have been very helpful in considering other opinions and options.
Point 2 is true to an extent, assuming you aren't monetizing your traffic though, wouldn't earning citations still be more valuable than not showing up in Google at all?
You should also consider that a large proportion of search is purely navigational. What if someone is trying to find your blog and is typing 'evgenii pendragon'. AI summaries don't show for this kind of search.
Currently I can see your site is still indexed (disallowing Google won't remove a page from the index if it's already been indexed) and so you show in search results, but because you block Google in robots.txt it can't crawl content and so shows an awkward 'No information is available for this page.' message. If you continue to block Google in robots.txt eventually the links to your site will disappear entirely.
Even if you really don't want to let AI summarize your content, I would at least allow crawlers to your homepage.
That is what I first thought of when I started seeing more comments noting that it might be a good idea to allow for search engine traffic. All of these suggestions lead me to think about it some more. I definitely will need to understand the domain of SEO and search engine indexing a little deeper if I want to make an educated decision about it and not shoot myself in the foot in the long run.
I appreciate your suggestion!
I think we are seeing the death of what was left of the open web, as people react to inconsiderate crawling for uses (AI) they are not sympathetic with by deciding trying to ban all automated access is the way to go. :(
The result will be that giant corporations and those will bad intents will still find a way to access what they need, but small, hobby citizen and civil society efforts will be blocked out.
I believe that this might give birth to new tools, protocols, and tech which will enable the next evolution of the Open Web into something akin to Protected Open Web.
I very much dislike the invasive scrapping approaches. If something were to be done about it, it would result in a new way that clients interact with resources on the web.
This article could have been two lines. It takes some serious stretching of school-essay-writing muscles to inflate it to this many pages of waffle.
I think a paragraph could have been enough to describe the issue.
My goal with this post was to describe my personal experience with the problem, research, and the solution - the overall journey. I also wanted to write it in a manner that a non-technical person would be able to follow. Hence, being more explicit in my elaborations.
Weirdly, this is something that Apple actually gets right - the little „previews” you get when sharing links in iMessage get generated client-side; _by the sender_.
There are good reasons why you’d not want to rely on clients providing this information when posting to LinkedIn (scams, phishing, etc); but it’s interesting to see an entirely different approach to the problem used here.
I came here to write that I expected that clients should generate previews after receiving a link inside a message. I also expected that somebody else would have already pointed that out and here we are.
However I also understand that there are a number of reasons for a server to scrape the link. In no particular order:
1. scraping all the things, they might be useful sometimes in the future.
2. the page changes, goes 404, the client is reset and loses its db and can't rebuild the preview, but the client can rely on the server for that
3. it's faster for the client as the preview comes with the message and it does not have to issue some extra calls and parse a possibly slow page.
Anyway you write that's the sender that generates the preview on iMessages so that's leaves point #1 and possibly the part of #2 about flaky internet connections: the server is in a better place to keep trying to generate the preview.
A curious detail I noticed: the author’s account is a bit older than a month, yet he managed to publish four of his own articles (and nothing more, obviously). All of them (three) has zero comments and two of those three are dead.
Yet, this post of his (posting his own work) gained traction. I believe for robots.txt topic rather than the article itself.
That shows that even if you ignore all the rules of keeping a healthy community (not publish your self promotion only), eventually you’d get traction and nobody would care, I guess.
Quick edit: my bad, wrong click brought me to the wrong location. So I made a bit wrong assumption. The author posted 4 extra posts alongside his own, so it’s not 100% of self-promotion, but 50%.
To be honest, I also think that the discussion has become more interesting than what I wrote in the article. I have learned quite a few things for myself. And on top of that it has been great to hear some feedback in regards to some thoughts that I wrote.
There are more articles available in my blog than I have shared here. I don't think that everything that I write is shareworthy on HN. There are some that I find to be more interesting. Those are the ones I end up sharing.
Like you have noticed, I try to share other interesting resources that I find online too. Is there a ratio of self/others content that would help keeping a healthy community?
I don’t consider HN a healthy community, so I believe this approach of self-promoting the only sane thing to actually deal with this website. So, no real judgement from me. I just find it curious that apparently some mod banned you, but this time it was too late, since the discussion became somewhat interesting.
If it was some other website, I’d say that if you post many of your own posts yourself many community members won’t be happy. I’d say go with not 50% but, say, at least 20% or even 10%. Again, my personal opinion of this website, it’s long beyond repair (if it ever was), so feel free to do whatever you want, unless some whims of mods ban you for no real reason one day.
Yeah, I don't even think I 100% understand how the algorithm and moderation works here.
I will consider to bring in more content that is other than my own. I did read the guidelines for HN, and I saw that it encourages members to share what they find, and occasionally their own stuff. Appreciate your opinion!
The problem is the robots that do follow robots.txt its all the bots that don't. Robots.txt is largely irrelevant now they don't represent most of the traffic problem. They certainly don't represent the bots that are going to hammer your site without any regard, those bots don't follow robots.txt.
That's what honeypots are for.
Deny /honeypot in your robots.txt
Add <a href="/honeypot" style="display:none" aria-hidden="true">ban me</a> to your index.html
If an IP accesses that path, ban it.
There is an interesting article about AI tarpits that addresses a similar issue: https://arstechnica.com/tech-policy/2025/01/ai-haters-build-...
The argument is basically to have them scrape your website indefinitely wasting their resources for the bots that decide to ignore your robots.txt (or any bot if you desire)
While this may work today, the bots now more and more use if not the full headless rendering then at least apply CSS rules not to fetch invisible content.
I wonder whether the path in the robots.txt (or maybe a <link> tag with a bogus rel attribute) would already be enough to make evil bots follow it. That would at least avoid accidental follows due to CSS/ARIA not properly working in weird constellations
> <a href="/honeypot" style="display:none" aria-hidden="true">ban me</a>
Unrelated meta question but is the aria tag necessarily since display: none; should be removing the content from the flow?
I would hope that all screen readers would respect display:none. The aria-hidden is for CYA, banning even one blind user would be quite bad optics (as is this sentence now that I think about it).
I like this. Adding now. Thanks!
Not sure why you were downvoted. I have zero confidence that OpenAI, Anthropic, and the rest respect robots.txt however much they insist they do. It's also clear that they're laundering their traffic through residential ISP IP addresses to make detection harder. There are plenty of third-parties advertising the service, and farming it out affords the AI companies some degree of plausible deniability.
One way to test if the model respects robots.txt could be to ask the model if it can scrape a given URL. One thing that it doesn't address though is the scrapping of the training data for the model. That area feels more like a wild west.
Nobody has any confidence in ai to not ddos. That's why there have been dozens of posts about how bandwidth is becoming an issue for many websites as bots continuously crawl their sites for new info.
Wikipedia has backups for this reason. AI companies ignore the readily available backups and instead crawl every page hundreds of times a day.
I think debian also recently spoke up about it.
The "full solution" to this, of course, is micropayments. A bot which has to pay a tenth of a cent to you every time it visits one of your pages or something else the page 404s will quickly rack up a $10 bill crawling a whole 10,000 page site. If it tries to do that every day, or every hour, that's an excellent payday for you and a very compelling reason for almost all bots to blacklist your domain name.
A human being who stops by to spend 20 minutes reading your blog once won't even notice they've spent 1.2 cents leafing through. This technology has existed for a while, and yet very few people have found it a good idea to wrap around. There is probably a good reason for that.
The realistic solution is to probably just do some statistics and figure out who's getting on your nerves, and then ban them from your digital abode. Annoying, but people go a lot farther to protect their actual homes if they happen to live in high crime areas.
Isn't this effectively what a tool like Anubis achieves?
What astounds me is there are no readily available libraries crawler authors can reach for to parse robots.txt and meta robots tags, to decide what is allowed, and to work through the arcane and poorly documented priorities between the two robots lists, including what to do when they disagree, which they often do.
Yes, there's an ancient google reference parser in C++11 (which is undoubtedly handy for that one guy who is writing crawlers in C++), but not a lot for the much more prevalent Python and JavaScript crawler writers who just want to check if a path is ok or not.
Even if bot writers WANT to be good, it's much harder than it should be, particularly when lots of the robots info isn't even in the robots.txt files, it's in the index.html meta tags.
robots.txt support is built into the Python stdlib as urllib.robotparser: https://docs.python.org/3/library/urllib.robotparser.html
rel=nofollow is a bad name. It doesn’t actually forbid following the link and doesn’t serve the same purpose as robots.txt.
The problem it was trying to solve was that spammers would add links to their site anywhere that they could, and this would be treated by Google as the page the links were on endorsing the page they linked to as relevant content. rel=nofollow basically means “we do not endorse this link”. The specification makes this more clear:
> By adding rel="nofollow" to a hyperlink, a page indicates that the destination of that hyperlink should not be afforded any additional weight or ranking by user agents which perform link analysis upon web pages (e.g. search engines).
> nofollow is a bad name […] does not mean the same as robots exclusion standards
— https://microformats.org/wiki/rel-nofollow
Thanks for this!
I don't see a reason why a good bot operator couldn't build a parser lib in a different language and put it on a public repo.
Shouldn't be that hard if someone WANT to be good.
Sure, but it's always easier to use a tool that's been tried and tested.
The "good" bot writers rarely have enough resources to demolish servers blindly, and are generally more careful whether or not you make it easier, so there's not much incentive.
I really think that most people should not use robots.txt
If you don't want people to crawl your content, don't put it online.
There are so many consequences of disallowing robots -- what about the Internet Archive for example?
> If you don't want people to crawl your content, don't put it online.
I sometimes put things online for specific people to view/use, or for my own purposes. That gets an “all crawlers can do one” robots.txt and sometimes a script or two to try waste a little the time of those that don't play ball.
It is online because I want it online, not for some random to hoover up.
I consider robots.txt as a garden gate. I know it isn't secure, but likewise someone peering directly into my back bedroom window knows just as well that I don't want them there.
I could put stuff like that behind authentication, but that is hassle for both me and the people who I want to be accessing the stuff. I usually use not-randomly-guessable URIs though sometimes that is inconvenient too, and anyway they do sometimes get “guessed”. I must have at least one friend-of-friend who has an active infestation which is reading their messages or browser history for things to probe because the traffic pattern isn't just preview generation, I've had known AI crawlers pass by some things repeatedly.
TBH I don't really care that much, much at all in fact, I just don't like the presumption that my stuff is free for commercial exploitation.
Nah, there's a middle way to that.
I want to post online but I don't want random asshole driveby's involved.
robots.txt seems to be an irresistible attractor for some, most recently in a crusade against all kinds of GenAI.
I get not wanting to have our data serve as training data, but I've also seen moderately large newspapers throwing literally all LLM bots in there, i.e. not only those that scrape training data, but also those that process users' search requests or even direct article fetches.
The obvious, but possibly not expected, result was that this newspaper became effectively invisible to user searches in ChatGPT. Not sure if I'm an outlier here, but I personally click through to many of the sources ChatGPT provides, so they must be losing tons of traffic that way.
Having worked on bot detection in the past. Some really simple old fashioned attacks happened by doing the opposite of what the robots.txt file says.
While I doubt it does much today, that file really only matters to those that want to play by the rules which on the free web is not an awful lot of the web anymore I’m afraid.
That was the first thing that I have learnt about the robots.txt file. Even RFC 9309 Robots Exclusion Protocol document: https://www.rfc-editor.org/rfc/rfc9309.html - mentions:
> These rules are not a form of access authorization.
Meaning that these are not enforced in any way. They cannot prevent you from accessing anything really.
I think the only approach that could work in this scenario would be to find which companies disregard the robots.txt, and bring it to the attention of technical community. Practices like these could make the company look shady and untrustworthy if found out. That could be one way to keep them accountable, even though there is still no guarantee they will abide by it.
This is a kind of scream test, even if self-inflicted. Scream tests are usually a good way to discover actual usage in complex (or not so complex) systems.
It definitely was (and still is to a degree) self-inflicted :)
It is fun to learn something when you discover an unintended consequence and then work backwards from it.
What a way to pupmp up own online presence with near-to-nothing actions.
I wish there were way less posts like this.
> But what about being closer to the top of the Google search results - you might ask? One, search engines crawling websites directly is only one variable in getting a higher search engine ranking. References from other websites will also factor into that.
As far as I remember from google search console, a disallow directive in robots.txt causes google not only to avoid crawling the page, but also to eventually remove the page from its index. It certainly shouldn't add any more pages to its index, external references or not.
Thank you for that note! I didn't know that. It is something I will need to figure out: either I'm ok with not being in the search engine OR I will update the robots.txt. Currently I'm relying on the social media traffic and the word-of-mouth marketing.
This reminds me of an old friend of mine who wrote long revelation posts on how he started using the "private" keyword in C++ after compiler helped him to find why a class member changed unexpectedly and how he no longer drives car with the clutch half-pressed because it burns the clutch.
if you are hosting a house party that invites the entire world robots.txt is a neon sign to guide guests to where the beers are, who's cooking what kind of burgers and on what grill; rules of the house etc - you'll still have to secure your gold chains and laptop in a safe somewhere or decide to even keep them in the same house yourself
Gold chains etc should be behind authentication. Robots txt is more like a warning sign that says the hedgemaze in the garden goes on forever, so probably stay out of it.
A brilliant analogy. Robots doesn't provide access controls. Authentication and authorization do.
This doesn't seem like a new discovery at all - this is what news publications have been dealing with ever since they went online.
You aren't going to get advertising without also providing value - be that money or information. Google has over 2 trillion in capitalization based primarily on the idea of charging people to get additional exposure, beyond what the information on their site otherwise would get.
I believe that as search engines continue to move toward AI summaries and responses, it will reduce the traffic to the websites since most people will be ok accepting the answers that the AI gave them.
My approach right now is to rely on social media traffic primarily where you can engage with the readers and build trust with the audience. I don't plan on using any advertising in the near future. While that might change, I am convinced that more intentional referral traffic will generate more intentional engagement.
LinkedIn is by far the worst offender in post previews. The doctype tag must be all lowercase. The HTML document must be well-formed (the meta tags must be in an explicit <head> block, for example). You must have OG meta tags for url, title, type, and image. The url meta tag gets visited, even if it's the same address the inspector is already looking at.
Fortunately, the post inspector helps you suss out what's missing in some cases, but c'mon, man, how much effort should I spend helping a social media site figure out how to render a preview? Once you get it right, and to quote my 13 year old: "We have arrived, father... but at what cost?"
I had to discover the Post Inspector tool which was very helpful as it provided error messages.
I didn't know that about the doctype tags. I must have had them right from the beginning. Didn't encounter those issues. It good to know though.
“Oh no my linked-in posts aren’t being put in front of enough people”
It is amazing what people think is important these days.
This is a mistake that many websites make, trying to block all robots, and the robots that serve their blog posts to users can't function anymore.
Agree. If you don't want it out there, put it in your journal or require a login.
Not every web site is a blog. Not every web site can be legally put behind a login.
What kind of information legally cannot be put behind a login?
Worst offenders I come across: official government information that needs to be public, placed behind Cloudflare, preventing even their M2M feeds (RSS, Atom, ...) to be accessed
Is cloudflare providing some sort of login screen? How is that possible? Surely, you're not confusing a CAPTCHA with a login?
No logins, but browser fingerprinting, behavior tracking, straight blocking, captchas and other 'verification', even creating bogus honeypots and serving up unrelated generated spam instead of real page content.
None of that prevents users from getting to the data though so I’m wondering what your point is
Maybe he is talking about stuff you're required by law to disclose but you don't really want to be seen too much. Like code of conduct, terms of service, retractions or public apologies.
Yes, there's often not much reason to block bots that abide by the rules. It just makes your site not show up on other search indexes and introduces problems for users. Malicious bots won't care about your robots.txt anyway.
Most bots don't serve the blog post to users.
[flagged]
isnt LinkedIn dead.
I feel it is morphing into Twitter/Facebook/Instagram more each day.
It used to be this ultrafake eternal job interview site, but people now seem uninhibited to go on wild political rants even there.
One can dream.
It can still be more dead
You shouldn't worry about LinkedIn, the cancer of the internet.