How real is this "crawler plague" that the author refers to? I haven't seen it. But that's just as likely to because I don't care, and therefore am not looking, as it is to be because it's not there. Loading static pages from CDN to scrape training data takes such minimal amounts of resources that it's never going to be a significant part of my costs. Are there cases where this isn't true?
Yes, it’s true. Most sites don’t have a forever cache TTL so a crawler that hits every page on a database-backed site is going to hit mostly uncached pages (and therefore the DB).
I also have a faceted search that some stupid crawler has spent the last month iterating through. Also mostly uncached URLs.
It's not that website owners don't care that they're frustrating users, losing visitors and customers, or creating a poor experience. It's an intractable problem for most website owners to combat the endless ways that their sites are being botted and bogged down, and having to pay for resources to handle the 98% of traffic their sites are getting that isn't coming from real users and customers. By all means, solve it and everyone will be happy.
How real is this "crawler plague" that the author refers to? I haven't seen it. But that's just as likely to because I don't care, and therefore am not looking, as it is to be because it's not there. Loading static pages from CDN to scrape training data takes such minimal amounts of resources that it's never going to be a significant part of my costs. Are there cases where this isn't true?
Yes, it’s true. Most sites don’t have a forever cache TTL so a crawler that hits every page on a database-backed site is going to hit mostly uncached pages (and therefore the DB).
I also have a faceted search that some stupid crawler has spent the last month iterating through. Also mostly uncached URLs.
[dead]
It's not that website owners don't care that they're frustrating users, losing visitors and customers, or creating a poor experience. It's an intractable problem for most website owners to combat the endless ways that their sites are being botted and bogged down, and having to pay for resources to handle the 98% of traffic their sites are getting that isn't coming from real users and customers. By all means, solve it and everyone will be happy.