Scraping heavy js based web pages that require a login

Red Squirrel

No Lifer
May 24, 2003
69,768
13,362
126
www.anyf.ca
I'm looking at scraping values off a Cordex rectifier system's web page. I have maybe a couple 100 of them to do and want to put the data all in a nice live table so when there are power outages we can monitor the battery voltages of all our buildings. We have a tool for that but we will be losing it soon. They are moving the alarms to another system that is inferior and does not show voltages. These pages require a login, uses https (making it hard to use packet sniffer to analyze it) and is also very heavy on JS, so looking at just the html you don't get any values.

I assume there must be some python libraries or something that could help me do that, and maybe some firefox extensions that can let me analyze it more. From using the "network" tool in Firefox I've found some basic stuff like json files I can get data off of, except when I try to do it manually it does not have the context of a login, so I presume my script would need to perform a login first, keep cookie info etc... Basically lot of digging around I'll need to do to get everything right so the session data is kept.

I'm looking mostly for guidance on how this sort of thing would normally be done. Am I more or less going down the right path looking at the "network" feature of Firefox and just analyzing the GET/POST requests and headers etc to see how the data is fetched, or is there perhaps an easier way? I would more than likely write the scraper in python and have the values go into a clean easy to read file, then I would use a monitoring software to simply monitor the file data points. Those details are minor and I'm not too worried about that part. It's mostly getting the data.
 

Red Squirrel

No Lifer
May 24, 2003
69,768
13,362
126
www.anyf.ca
I can do that too...

Most of those just cover basic html like if you want to look at the actual html data and pickup tags etc.

I'll just keep doing what I was doing then. Just thought I'd ask if there is maybe tools I should be aware of. I also don't use Chrome I use Firefox.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,578
4,492
75
Am I more or less going down the right path looking at the "network" feature of Firefox and just analyzing the GET/POST requests and headers etc to see how the data is fetched, or is there perhaps an easier way?
That's probably the "right" way to go about it. But if you can't easily find the data you're looking for that way, you could try a "headless browser". That's a browser that goes to a URL, parses everything including the JS, but doesn't display to any screen.
 

mxnerd

Diamond Member
Jul 6, 2007
6,799
1,103
126
Probably Selinium?



for Firefox





Actually you can use many different languages and definitely many tutorials on youtube.
 
Last edited:

purbeast0

No Lifer
Sep 13, 2001
53,478
6,317
126
Could you use something like Poster for this? I don't remember if that is a plugin you have to download for Firefox or if it's caked in. Chrome has something similar called Postman.

I think you can give login credentials to the requests too but I am not 100% positive about how the auth/auth works for various systems. I have used PKI certs to "login" with Poster before and then I could make all the requests as if I was logged into the actual website.
 

Red Squirrel

No Lifer
May 24, 2003
69,768
13,362
126
www.anyf.ca
I'll probably be starting on a project in the new year, from quick reading it looks like Selenium may do what I need. Might actually be easier than I thought. I was looking at one of the web applications we use and there's so much ajax and JS etc going on that it's going to be rather complex to try to scrape and do what needs to be done but Selenium might actually handle a lot of that if it's simply looking at the elements of the web page and interacting with them. Still need to look more deeper into it though. It looks like it basically controls the browser, but I will probably want this running on a back end server so hopefully it can still do that as a background thing using the browser installed there (probably Firefox). Basically we have a ticket system that is really tedious to use and we essentially need to create the ticket 3 times because of the way it works, so I want to make a web form that you submit and it then the python script would pick it up and do all the dirty work.

So yeah I'll look into Selenium and go from there.
 

mxnerd

Diamond Member
Jul 6, 2007
6,799
1,103
126
Well, Anandtech's website itself is very complex.

I did some testing a while back. The hard part is finding the correct object in the web page and timing. This runs in the frontend, could be very different from backend processing.

Sometimes it's working and sometimes it's not, seems a lot of timing issues (web page loading time can be very long sometimes) and even object / object id changes? Did not investigate further.

OK, found that user and password object number do change, probably increase over time according to forum's user login sessions, in order to make it hard for hackers.
And even login button also changes!
(flipping between js-XFUniqueId125 and js-XFUniqueId126 ) No wonder the script couldn't find correct objects.

To get the correct object, right click on the field, and choose Inspect.


If it's a internal web page, you probably won't encounter as many issues with pages that's got loads of ads and with different login object id.

Python:
from selenium import webdriver

#new method to install driver, will find latest driver build automatically
#don't have to find the correct exe file from internet and save to particular folder manually under Windows.
from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.firefox import GeckoDriverManager

from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from time import sleep
from selenium.webdriver.common.action_chains import ActionChains

# choose Chrome or Firefox, arguments are quite different between the two.
driver_choice = 2

#Chrome
if driver_choice == 1:
    opts = webdriver.ChromeOptions()
    opts.add_argument('--ignore-certificate-errors')
    opts.add_argument('--disable-notifications')
    opts.add_argument('--start-maximized')
    #opts.add_argument('--incognito')
    #opts.add_argument('--headless')
    driver = webdriver.Chrome(executable_path=ChromeDriverManager().install())
    #new method to install driver, will find latest driver build automatically

#Firefox
if driver_choice == 2:
#    from selenium.webdriver.firefox.options import Options
    opts = webdriver.FirefoxOptions()
    #opts.add_argument('-private')
    #opts.add_argument('-headless')
    driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())

#No extensions will be loaded when browser starts

print('current browser is: ', driver.name)

driver.implicitly_wait(20)
driver.get("http://forums.anandtech.com")
#print (driver.page_source)

# handling accept cookie popup
# https://stackoverflow.com/questions/64032271/handling-accept-cookies-popup-with-selenium-in-python

accept_cookie_btn = '//*[@id="XF"]/body/div[3]/ul/li/div/div[2]/div[2]/a[1]/span'
dismiss_P_N_notice_btn = '//*[@id="js-XFUniqueId3"]'
community_question_btn = '//*[@id="js-XFUniqueId4"]'
bell_alert_circle = '//*[@id="onesignal-bell-launcher"]/div[1]/svg/circle'
login_btn = '//*[@id="top"]/div[1]/div[2]/nav/div/div[4]/div[1]/a[1]/span'

# driver.find_element_by_xpath(accept_cookie_btn).click()
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH,accept_cookie_btn))).click()
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH,dismiss_P_N_notice_btn))).click()
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH,community_question_btn))).click()

driver.implicitly_wait(10)

actions = ActionChains(driver)

#anandtech login form
driver.find_element_by_xpath(login_btn).click()
#login actions - filling username and password
sleep(2)

username = "your_username"
password = "your_password"

#right click on the fields and inspect to find correct object id
frm_username = '//*[@id="_xfUid-1-1605625409"]'
frm_password = '//*[@id="_xfUid-2-1605625409"]'
frm_login_btn = '//*[@id="js-XFUniqueId126"]/div/form/div[1]/dl/dd/div/div[2]/button'
driver.find_element_by_xpath(frm_username).click()
driver.send_keys(username)
driver.find_element_by_xpath(frm_password).click()
driver.send_keys(password)

sleep(2)

#click login button on login popup form
driver.find_element_by_xpath(frm_login_btn).click()
#driver.close()

Used PyScripter for debugging.

Well, not only that login button id flipping between 125 & 126, it even sometimes comes out as 127.
The only way that Selenium can consistently work is to use tab keys after filling username and password then tab through other objects on the form and lands on the login button.

driver.sendeys method no longer works, you have to use some_element.sendkeys

Python:
elem1 = driver.find_element_by_name('login')
elem1.send_keys(username)
elem2 = driver.find_element_by_name('password')
elem2.send_keys(password)
# tab through input fields and other web page objects
elem2.send_keys(Keys.TAB)
sleep(1)

elem3 = driver.switch_to.active_element
elem3.send_keys(Keys.TAB)
sleep(1)

elem4 = driver.switch_to.active_element
elem4.send_keys(Keys.TAB)
sleep(1)

elem5 = driver.switch_to.active_element
elem5.send_keys(Keys.TAB)
sleep(1)

elem6 = driver.switch_to.active_element
elem6.click()
 
Last edited:

Cogman

Lifer
Sep 19, 2000
10,284
138
106
For JS, I'd probably use a testing e2e framework in order to do the scraping. One that my company is using for testing is https://www.cypress.io/ Which is pretty dang slick. I imagine it might fit your problem pretty well.