Web scraping C#

MrScott81 · Dec 21, 2014

Trying to scrape some data from this website but if when I grab read the page (and if you right click and view source in chrome) you end up getting something different than the final source:

http://tempostorm.com/decks/kitkatzs-control-warrior

Once the page is completely loaded you can inspect the page using chrome debugging you can see the "final" version.

Any idea how to get this "final" version in C#?

This is a snippet from what I'm using:

Code:

var wc = new WebClient();
var websiteContent = await wc.DownloadStringTaskAsync(new Uri(url));

Markbnj · Dec 21, 2014

You need to run a headless browser like phantomJS to render the page in memory. You're getting the initial version before the javascript runs client side and modifies the DOM. I haven't used phantom with C# on the .Net stack, but I know people are doing it.

Edit: I should add to my simplistic answer. Maybe you have to render the page, maybe you don't. It all depends on the page in question. There are all sorts of little tricks to this, and you pretty much have to dive into the page to figure out the best way to get what you want.

The bottom line is that after that page is loaded in a browser client-side javascript is going to run and do stuff to it. Sometimes that means the stuff you want is going to get fetched from a service sometime during or after page load. Sometimes the stuff you want is already in the page source, but it's somewhere else and the script is going to move it around when the page is rendered. It all depends on how they structured the site and when they load content.

KB · Dec 22, 2014

You could try loading the page in the WebBrowser control in .Net and then perusing the DOM from the HTMLDocument property of the control.

http://forum.codecall.net/topic/73737-webbrowser-web-scrapping/

MrScott81 · Dec 23, 2014

WebBrowser doesn't get the proper html as far as I can tell, I'll have to keep digging.

Markbnj · Dec 23, 2014

MrScott81 said:
WebBrowser doesn't get the proper html as far as I can tell, I'll have to keep digging.

So the WebBrowser control doesn't render javascript? Anyway you're going to have to find a way to do that or mine the info you want from what is in the initial html response. If it isn't there then rendering is the only choice. Google around for C# and headless browsers. I see a few discussions. Must be some solution you can put together.

nickbits · Dec 23, 2014

WebBrowser control should work. Make sure to wait for DocumentComplete before you grab the html. I found it works best to use InvokeScript and call eval with document.body.outerHTML as its argument.

e.g.
browser.Document.InvokeScript("eval", new object[] { "document.body.outerHTML" });

Also maybe useful--I'm a fan of CsQuery for traversing the html.

Web scraping C#

MrScott81

Golden Member

Markbnj

Elite Member <br>Moderator Emeritus

KB

Diamond Member

MrScott81

Golden Member

Markbnj

Elite Member <br>Moderator Emeritus

nickbits

Diamond Member

TRENDING THREADS