• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Downloading a web page and parsing the HTML with Visual Studio Express...

Ichinisan

Lifer
There's a web page that our company uses to manage cable modems and equipment for our customers. There's a page that returns a poorly-formatted list of statistics for all customers in a specific region. I want to make a program that downloads this page and extracts the data from the HTML table, then allows me to work with the data and re-sort it. I've been doing this manually for a while, but it's VERY tedious. It's also frustrating because Notepad++ keeps crashing (the syntax highlighting gets confused when working with large amounts of data).

Anyway, this page requires a log-in, but doesn't use HTTPS (as far as I can tell). I don't know how to make my program simulate logging-in like a web browser. I found some code online that allows me to perform an HTTP Get request, but the server always returns an error (probably because I need to simulate logging-in somehow).

If any of you have done something like this with a VS2010 project before, please let me know if you have some pointers.
 
Depends on the type of authentication used by the page. For most sites what you need to do is find out the URL to which the username/password are posted when a user logs in. When you post the right tokens to that URL the server will present an authentication cookie back (along with possibly a redirect to another page), which you save and present back to the server with all future requests. Not too complicated.
 
Here are the basic steps using Wininet, I haven't done this in .NET yet. If you go to CodeProject.com you can probably dig up some sample code, I can't share any from work.

Create a session object

GET the login page, scrape (parse) any tokens needed out of the HTML for the login form (there are often hidden fields with some kind of one-time token or session ID)

POST the login form. This might be either URL encoding or multipart/form-encoding, look at the method attribute of the form tag to see

For simple authentication, the server response will include session cookie(s) in the header of the response, which will be processed by your session object and used automatically in future GET / POST requests.

There are more complicated authentication schemes that require following any redirects and doing follow-up POSTs or GETs, and sometimes the login process might use script code to set cookies (which you then need to emulate manually).


If you get stuck, you can use something like Wireshark to compare the traffic between the browser and the site with the traffic between your app and the site, to see what you're doing differently. There will usually be a bunch of extra GET operations for things like images and .js files that you can ignore, but the POST step should include the same fields in the same order.
 
Last edited:
So, you have a single page with all the data you need, in a single table no less, yet you want to download it all and parse it with VS.NET? IMHO, you're going at it from the wrong angle. I would leave the data on the page, but use a User Script to parse it. Chrome parses them natively; Firefox requires the Greasemonkey add-on. But if you can get that far, staying on the same page seems a lot simpler.

If you're looking for sorting code, there are a number of user scripts that sort other tables already; but I'd start with MediaWiki's sorting code.
 
You don't want to extract your data right out of the HTML. Then your program will be dependent on the HTML layout, which is really fragile way to do it. A small change to their layout could break your extraction.

I would make a request to get them to put up a small API (that returns JSON or something). It shouldn't take them more than 10 minutes to put together. Its the same data aggregation as generating the HTML, just a different presentation.

I actually just did something like this at work for another team. It literally took me 5 minutes.
 
Last edited:
Back
Top