• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Web scraping C#

MrScott81

Golden Member
Trying to scrape some data from this website but if when I grab read the page (and if you right click and view source in chrome) you end up getting something different than the final source:

http://tempostorm.com/decks/kitkatzs-control-warrior

Once the page is completely loaded you can inspect the page using chrome debugging you can see the "final" version.

Any idea how to get this "final" version in C#?

This is a snippet from what I'm using:
Code:
var wc = new WebClient();
var websiteContent = await wc.DownloadStringTaskAsync(new Uri(url));
 
You need to run a headless browser like phantomJS to render the page in memory. You're getting the initial version before the javascript runs client side and modifies the DOM. I haven't used phantom with C# on the .Net stack, but I know people are doing it.

Edit: I should add to my simplistic answer. Maybe you have to render the page, maybe you don't. It all depends on the page in question. There are all sorts of little tricks to this, and you pretty much have to dive into the page to figure out the best way to get what you want.

The bottom line is that after that page is loaded in a browser client-side javascript is going to run and do stuff to it. Sometimes that means the stuff you want is going to get fetched from a service sometime during or after page load. Sometimes the stuff you want is already in the page source, but it's somewhere else and the script is going to move it around when the page is rendered. It all depends on how they structured the site and when they load content.
 
Last edited:
WebBrowser doesn't get the proper html as far as I can tell, I'll have to keep digging.

So the WebBrowser control doesn't render javascript? Anyway you're going to have to find a way to do that or mine the info you want from what is in the initial html response. If it isn't there then rendering is the only choice. Google around for C# and headless browsers. I see a few discussions. Must be some solution you can put together.
 
WebBrowser control should work. Make sure to wait for DocumentComplete before you grab the html. I found it works best to use InvokeScript and call eval with document.body.outerHTML as its argument.

e.g.
browser.Document.InvokeScript("eval", new object[] { "document.body.outerHTML" });


Also maybe useful--I'm a fan of CsQuery for traversing the html.
 
Back
Top