Bram's Dev Blog

home

100 Days of Code Day 74 - Scholar Bot Detection

06 Nov 2018

Google thinks analyzer.atmire.com is a robot

A first glance at the additional logging I implemented yesterday revealed that Google Scholar is currently showing a captcha to my server so it can prove it is not a robot.

The code is exactly the same as what I’m running locally, so I would assume that it’s not the code itself that sets off these triggers, but maybe rather the volume of requests, the frequency or perhaps the IP it’s coming from.

Tunneling to click the captcha

The first thing I’m trying to do is to route a few page requests to google scholar over the server, to my own machine, so I can click the captcha and maybe as a one-off, get past this.

How to Use SSH Tunneling to Access Restricted Servers and Browse Securely led me to the instructions for Dynamic port forwarding, enabling me to send all my browser traffic through my server.

Clicking around a few times on Google Scholar did not show me the captcha, so I wasn’t able to click it. Maybe because my browser also has Javascript enabled while the server has not.

scholar.py experiences

People using the scholar.py script have been running into the same issue. How to deal with “Please show you’re not a robot” links out to three other issues where people have offered suggestions.

From the suggestions, I’m already collection and setting cookies by initially hitting the homepage before doing any queries. I also already set a user agent.

Changing the target url

My queries used to go out to http://scholar.google.com. When I’m making a few curl commands from the server, I get a redirect to https://scholar.google.com and another one from there to https://scholar.google.be. So just to ensure that these redirects are not the issue, I’ve updated the code to do all calls straight to https://scholar.google.be

No luck so far

Whenever I try to lookup an item url, the server is still getting “Please show that you’re not a robot”. If the current restrictions are IP bound, maybe there is a way how I could get the requests go out via other proxies so not all requests are arriving at Scholar from the same IP.

Otherwise, I just may have to accept that people will have to do their item verifications manually, instead of doing those for them.

Day 74 Plan

Do a test to see if issuing the requests from other IPs could give different results.

After that, ask my colleagues for assistance to access the assets and hopefully get a successful test run in place for the language switch.

Future days - DSpace 7 Angular

Future days - Analyzer.atmire.com work

Future Days - Productivity

Future days - Jekyll http://bram-atmire.github.io/ site

Future Days - Atmire.com work

Investigate and work on search engine optimization (SEO) for the main atmire.com website.

Future Days - Learning just for learning

Sustainability challenge - Finish before Christmas

If I continue like October, I could hit day 68 by end of October and day 98 by end of November.