Scraping heavy js based web pages that require a login

Red Squirrel · Nov 11, 2020

I'm looking at scraping values off a Cordex rectifier system's web page. I have maybe a couple 100 of them to do and want to put the data all in a nice live table so when there are power outages we can monitor the battery voltages of all our buildings. We have a tool for that but we will be losing it soon. They are moving the alarms to another system that is inferior and does not show voltages. These pages require a login, uses https (making it hard to use packet sniffer to analyze it) and is also very heavy on JS, so looking at just the html you don't get any values.

I assume there must be some python libraries or something that could help me do that, and maybe some firefox extensions that can let me analyze it more. From using the "network" tool in Firefox I've found some basic stuff like json files I can get data off of, except when I try to do it manually it does not have the context of a login, so I presume my script would need to perform a login first, keep cookie info etc... Basically lot of digging around I'll need to do to get everything right so the session data is kept.

I'm looking mostly for guidance on how this sort of thing would normally be done. Am I more or less going down the right path looking at the "network" feature of Firefox and just analyzing the GET/POST requests and headers etc to see how the data is fetched, or is there perhaps an easier way? I would more than likely write the scraper in python and have the values go into a clean easy to read file, then I would use a monitoring software to simply monitor the file data points. Those details are minor and I'm not too worried about that part. It's mostly getting the data.

mxnerd · Nov 12, 2020

Searched.

https://www.youtube.com/results?search_query=chrome+extension+data+scraper

https://www.youtube.com/results?search_query=python+web+scraper

Red Squirrel · Nov 12, 2020

I can do that too...

Most of those just cover basic html like if you want to look at the actual html data and pickup tags etc.

I'll just keep doing what I was doing then. Just thought I'd ask if there is maybe tools I should be aware of. I also don't use Chrome I use Firefox.

Ken g6 · Nov 12, 2020

Red Squirrel said:
Am I more or less going down the right path looking at the "network" feature of Firefox and just analyzing the GET/POST requests and headers etc to see how the data is fetched, or is there perhaps an easier way?

That's probably the "right" way to go about it. But if you can't easily find the data you're looking for that way, you could try a "headless browser". That's a browser that goes to a URL, parses everything including the JS, but doesn't display to any screen.

mxnerd · Nov 12, 2020

Probably Selinium?

Selenium

Selenium automates browsers. That's it! What you do with that power is entirely up to you. Primarily it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should) also be automated as well. Getting...

www.selenium.dev

https://www.selenium.dev/documentation/

www.selenium.dev

https://github.com/mozilla/geckodriver/releases

for Firefox

Selenium IDE â Get this Extension for ð¦ Firefox (en-US)

Download Selenium IDE for Firefox. Selenium IDE is an integrated development environment for Selenium tests. It is implemented as a Firefox extension, and allows you to record, edit, and debug tests.

addons.mozilla.org

selenium

Official Python bindings for Selenium WebDriver

pypi.org

Login and navigate a website automatically with Selenium — CARREFAX

Selenium is an incredible Python package for automating task in the browser. Essentially, Selenium can be used to script interaction with a website by taking control of the browser using Python. This example demonstrates how to complete a login form and navigate to various pages behind t

carrefax.com

Web scraping using Selenium Python

A very simple and illustrative tutorial on web scrapping ad web automation using selenium pytho

medium.com

Actually you can use many different languages and definitely many tutorials on youtube.

purbeast0 · Nov 12, 2020

Could you use something like Poster for this? I don't remember if that is a plugin you have to download for Firefox or if it's caked in. Chrome has something similar called Postman.

I think you can give login credentials to the requests too but I am not 100% positive about how the auth/auth works for various systems. I have used PKI certs to "login" with Poster before and then I could make all the requests as if I was logged into the actual website.

mxnerd · Nov 12, 2020

mitm proxy probably can help with the encrypted traffic.

mitmproxy - an interactive HTTPS proxy

mitmproxy.org

mxnerd · Nov 13, 2020

Might need a combination of Selenium + BeautifulSoup

Web Scraping using Beautiful Soup and Selenium for dynamic page

Web Scraping

medium.com

Red Squirrel · Dec 19, 2020

I'll probably be starting on a project in the new year, from quick reading it looks like Selenium may do what I need. Might actually be easier than I thought. I was looking at one of the web applications we use and there's so much ajax and JS etc going on that it's going to be rather complex to try to scrape and do what needs to be done but Selenium might actually handle a lot of that if it's simply looking at the elements of the web page and interacting with them. Still need to look more deeper into it though. It looks like it basically controls the browser, but I will probably want this running on a back end server so hopefully it can still do that as a background thing using the browser installed there (probably Firefox). Basically we have a ticket system that is really tedious to use and we essentially need to create the ticket 3 times because of the way it works, so I want to make a web form that you submit and it then the python script would pick it up and do all the dirty work.

So yeah I'll look into Selenium and go from there.

mxnerd · Dec 20, 2020

Well, Anandtech's website itself is very complex.

I did some testing a while back. The hard part is finding the correct object in the web page and timing. This runs in the frontend, could be very different from backend processing.

Sometimes it's working and sometimes it's not, seems a lot of timing issues (web page loading time can be very long sometimes) and even object / object id changes? Did not investigate further.

OK, found that user and password object number do change, probably increase over time according to forum's user login sessions, in order to make it hard for hackers.
And even login button also changes! (flipping between js-XFUniqueId125 and js-XFUniqueId126 ) No wonder the script couldn't find correct objects.

To get the correct object, right click on the field, and choose Inspect.

If it's a internal web page, you probably won't encounter as many issues with pages that's got loads of ads and with different login object id.

Python:

from selenium import webdriver

#new method to install driver, will find latest driver build automatically
#don't have to find the correct exe file from internet and save to particular folder manually under Windows.
from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.firefox import GeckoDriverManager

from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from time import sleep
from selenium.webdriver.common.action_chains import ActionChains

# choose Chrome or Firefox, arguments are quite different between the two.
driver_choice = 2

#Chrome
if driver_choice == 1:
    opts = webdriver.ChromeOptions()
    opts.add_argument('--ignore-certificate-errors')
    opts.add_argument('--disable-notifications')
    opts.add_argument('--start-maximized')
    #opts.add_argument('--incognito')
    #opts.add_argument('--headless')
    driver = webdriver.Chrome(executable_path=ChromeDriverManager().install())
    #new method to install driver, will find latest driver build automatically

#Firefox
if driver_choice == 2:
#    from selenium.webdriver.firefox.options import Options
    opts = webdriver.FirefoxOptions()
    #opts.add_argument('-private')
    #opts.add_argument('-headless')
    driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())

#No extensions will be loaded when browser starts

print('current browser is: ', driver.name)

driver.implicitly_wait(20)
driver.get("http://forums.anandtech.com")
#print (driver.page_source)

# handling accept cookie popup
# https://stackoverflow.com/questions/64032271/handling-accept-cookies-popup-with-selenium-in-python

accept_cookie_btn = '//*[@id="XF"]/body/div[3]/ul/li/div/div[2]/div[2]/a[1]/span'
dismiss_P_N_notice_btn = '//*[@id="js-XFUniqueId3"]'
community_question_btn = '//*[@id="js-XFUniqueId4"]'
bell_alert_circle = '//*[@id="onesignal-bell-launcher"]/div[1]/svg/circle'
login_btn = '//*[@id="top"]/div[1]/div[2]/nav/div/div[4]/div[1]/a[1]/span'

# driver.find_element_by_xpath(accept_cookie_btn).click()
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH,accept_cookie_btn))).click()
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH,dismiss_P_N_notice_btn))).click()
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH,community_question_btn))).click()

driver.implicitly_wait(10)

actions = ActionChains(driver)

#anandtech login form
driver.find_element_by_xpath(login_btn).click()
#login actions - filling username and password
sleep(2)

username = "your_username"
password = "your_password"

#right click on the fields and inspect to find correct object id
frm_username = '//*[@id="_xfUid-1-1605625409"]'
frm_password = '//*[@id="_xfUid-2-1605625409"]'
frm_login_btn = '//*[@id="js-XFUniqueId126"]/div/form/div[1]/dl/dd/div/div[2]/button'
driver.find_element_by_xpath(frm_username).click()
driver.send_keys(username)
driver.find_element_by_xpath(frm_password).click()
driver.send_keys(password)

sleep(2)

#click login button on login popup form
driver.find_element_by_xpath(frm_login_btn).click()
#driver.close()

Used PyScripter for debugging.

Well, not only that login button id flipping between 125 & 126, it even sometimes comes out as 127.
The only way that Selenium can consistently work is to use tab keys after filling username and password then tab through other objects on the form and lands on the login button.

driver.sendeys method no longer works, you have to use some_element.sendkeys

Python:

elem1 = driver.find_element_by_name('login')
elem1.send_keys(username)
elem2 = driver.find_element_by_name('password')
elem2.send_keys(password)
# tab through input fields and other web page objects
elem2.send_keys(Keys.TAB)
sleep(1)

elem3 = driver.switch_to.active_element
elem3.send_keys(Keys.TAB)
sleep(1)

elem4 = driver.switch_to.active_element
elem4.send_keys(Keys.TAB)
sleep(1)

elem5 = driver.switch_to.active_element
elem5.send_keys(Keys.TAB)
sleep(1)

elem6 = driver.switch_to.active_element
elem6.click()

Cogman · Dec 21, 2020

For JS, I'd probably use a testing e2e framework in order to do the scraping. One that my company is using for testing is https://www.cypress.io/ Which is pretty dang slick. I imagine it might fit your problem pretty well.

Search

Scraping heavy js based web pages that require a login

Red Squirrel

No Lifer

mxnerd

Diamond Member

Red Squirrel

No Lifer

Ken g6

Programming Moderator, Elite Member

mxnerd

Diamond Member

Selenium

https://www.selenium.dev/documentation/

Selenium IDE â Get this Extension for ð¦ Firefox (en-US)

selenium

Login and navigate a website automatically with Selenium — CARREFAX

Web scraping using Selenium Python

purbeast0

No Lifer

mxnerd

Diamond Member

mitmproxy - an interactive HTTPS proxy

mxnerd

Diamond Member

Web Scraping using Beautiful Soup and Selenium for dynamic page

Red Squirrel

No Lifer

mxnerd

Diamond Member

Cogman

Lifer

TRENDING THREADS