Forum is being scraped again

Page 36 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Nov 17, 2019
13,229
7,851
136
There's a MASSIVE gulf of difference between 5K-10K real users just browsing the site and consuming content vs. 350,000+ bots scraping your site, following every possible link, downloading every possible image, cataloguing whatever it possibly can.
Archive anything over a year old and block all outside access to it. Maybe only make it available to members meeting certain criteria.


They can't scrape what they can't find.
 
  • Like
Reactions: igor_kavinski

CParsons

Staff member
Dec 4, 2019
39
61
91
Archive anything over a year old and block all outside access to it. Maybe only make it available to members meeting certain criteria.


They can't scrape what they can't find.
Lots of avenues to explore for sure, but important to remember that all sites these days are at the whim of the almighty Google. Praise Google. Feed Google. All hail Google. 🤣
 
  • Haha
Reactions: 511

Red Squirrel

No Lifer
May 24, 2003
70,151
13,565
126
www.anyf.ca
Wow guess these bots are really hammering the server hard then. The amount they hammer must scale with size of site.

Mine gets hammered randomly but highest it got was 1964 at once which is peanuts.
 
  • Like
Reactions: CParsons

DrMrLordX

Lifer
Apr 27, 2000
22,695
12,643
136
If it's AI scrapers doing the damage, they could potentially follow us to other forums. Which would be troubling. Though they might not if the new forum lacked the depth of content that's present here.
 

Red Squirrel

No Lifer
May 24, 2003
70,151
13,565
126
www.anyf.ca
If they're simply hitting the forum too fast wonder if we could just set rate limiting. If an IP is hitting the server more than X per second it gets temporarily blocked. Not sure how easy it would be to set that up, guess it could be done with a script that looks at the apache log.
 
Jul 27, 2020
25,998
17,945
146
F*** Google
Necessary evil. Can you imagine how bad internet search would be without feeding that unholy beast? They exist because they fulfill a purpose that no one, even Microsoft has been able to challenge with their billions. When's the last time you actually got a useful result from Bing? No matter how bad they are, their stuff works and helps hundreds of millions, if not billions, of people daily. I don't get a single cent from Google (got no sense to understand how adsense works) but if they suddenly vanish into thin air, the ensuing chaos could take years to disperse and the company or companies filling the void may never approach the quality they provide.

And don't even dare to imagine a world where Apple is the only quality mobile platform without the existence of Android. You F Google, you F everybody.
 

511

Platinum Member
Jul 12, 2024
2,871
2,873
106
Mhhh maybe we can
Necessary evil. Can you imagine how bad internet search would be without feeding that unholy beast? They exist because they fulfill a purpose that no one, even Microsoft has been able to challenge with their billions. When's the last time you actually got a useful result from Bing? No matter how bad they are, their stuff works and helps hundreds of millions, if not billions, of people daily. I don't get a single cent from Google (got no sense to understand how adsense works) but if they suddenly vanish into thin air, the ensuing chaos could take years to disperse and the company or companies filling the void may never approach the quality they provide.

And don't even dare to imagine a world where Apple is the only quality mobile platform without the existence of Android. You F Google, you F everybody.
Yeah that is true for biggest corporations
 
  • Like
Reactions: igor_kavinski

Red Squirrel

No Lifer
May 24, 2003
70,151
13,565
126
www.anyf.ca
Google decided to block my server's IP at some point so gmail won't work. My site does send out lot of email but it's not spam, it's just registration, password resets etc... lot of it is bot generated, they sign up and complete the first registration step then get stuck at the captcha that's required for the final stage. This does generate lot of email traffic though so I may need to modify it so captcha is on main registration page. I may also look at offloading SMTP to a 3rd party. SMTP is pain.

Code:
to=<[user]@gmail.com>, relay=gmail-smtp-in.l.google.com[142.250.31.26]:25, delay=0.85, delays=0.14/0.01/0.24/0.46, dsn=5.7.1, status=bounced (host gmail-smtp-in.l.google.com[142.250.31.26] said: 550-5.7.1 [144.217.157.4      18] Gmail has detected that this message is likely 550-5.7.1 suspicious due to the very low reputation of the sending IP address. 550-5.7.1 To best protect our users from spam, the message has been blocked. 550-5.7.1 For more information, go to 550 5.7.1  https://support.google.com/mail/answer/188131 af79cd13be357-7e2f30b81f7si23159785a.170 - gsmtp (in reply to end of DATA command))
 
Jul 27, 2020
25,998
17,945
146
You may have mistyped your email since the messages bounced back. You should be able to log in now.
Not going to bother and thanks for raising the possibility that I have such fat pudgy fingers that I can't even type my own email properly. It bounced back for the same reason it did for Reds' forum. Non-compliance with Google's updated rules of engagement.
 
Nov 17, 2019
13,229
7,851
136
The message literally says:



Remote host said: 550-5.1.1 The email account that you tried to reach does not exist. Please try double-checking the recipient's email address for typos or unnecessary spaces.
 
Jul 27, 2020
25,998
17,945
146
Remote host said: 550-5.1.1 The email account that you tried to reach does not exist. Please try double-checking the recipient's email address for typos or unnecessary spaces.
If you are able to get that much detail from the Admin or if you are yourself the admin, PM me the email please so I can admit that YES, it was my mistake and then I will try again. Thanks.
 

Red Squirrel

No Lifer
May 24, 2003
70,151
13,565
126
www.anyf.ca
@igor_kavinski @GodisanAtheist I send a test email from my personal email just to test the gmail fiasco. My personal email server is on a different IP than the server being used by the forum so curious if it goes through. According to logs it did. Also your account is active, I did it manually for now. Still need to sort out the forum sending mail to gmail.

I may look at offloading that to a 3rd party if worse comes to worse. Google doesn't really like small guys hosting their own mail it seems. Good way to convince people to use their cloud services lol.