Why scraping sucks

Markbnj

Elite Member <br>Moderator Emeritus
Moderator
Sep 16, 2005
15,682
14
81
www.markbetz.net
It's almost impossible to describe to a client any of what makes data on a given page easier or harder to get. They can see the data, and everything beyond that is spoken in klingon.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,453
4,289
75
Hm, this might be of use. Maybe ask them to put tags like that around anything they'd use for a Microsoft Word form letter.
 

Markbnj

Elite Member <br>Moderator Emeritus
Moderator
Sep 16, 2005
15,682
14
81
www.markbetz.net
Hm, this might be of use. Maybe ask them to put tags like that around anything they'd use for a Microsoft Word form letter.

Unfortunately this particular gig is uninvited scraping, so we have to work with what we have. The data can be had, but obviously every site is different. Right now I'm dealing with one where I have to generate an ASP postback in order to navigate pages. You try describing this stuff to the client when the page looks just like the page you easily scraped yesterday, and their eyes just glaze over.
 

Train

Lifer
Jun 22, 2000
13,578
73
91
www.bing.com
A while ago I worked with tool called "Center Stage" that was basically the Rolls Royce of page scrapers.

You'll never get hand written scraping to work in the long run, it is just too brittle. Center Stage used some slick pattern/format matching and a GUI based mapping of fields to your output (like columns in a DB table) Any place where it wasn't smart enough to figure out what you were looking for, you could plug in your own java methods to do custom matching.

It was used heavily in the early days by companies like Travelocity to scrape plane/hotel info, before that data was all available via API's. The only problem was a typical license was about $250k (this was in 1999 I think)

Never heard of them after the boom, I heard they were getting bought out by some CMS company. And haven't seen anything like it since. Maybe because page scraping just isn't as big as it used to be.

p.s. Sorry if this post was just me rambling and not of any help :/
 

Markbnj

Elite Member <br>Moderator Emeritus
Moderator
Sep 16, 2005
15,682
14
81
www.markbetz.net
Yeah, we're not hard-coding it. Using scrapy and xpath templates. It works really well but every page throws you some curve-ball. Really just more of a vent about trying to communicate challenges at this level to a client.
 

KB

Diamond Member
Nov 8, 1999
5,406
388
126
Hah and then they dramatically update their website and your client gets made at you because the scraping stopped working.
 

Markbnj

Elite Member <br>Moderator Emeritus
Moderator
Sep 16, 2005
15,682
14
81
www.markbetz.net
trust issues suck.

You know, I shouldn't give the wrong impression, because this client is actually pretty easy going. It's more a recognition that in a lot of areas you can describe the challenging parts to clients in terms they understand, but when it comes to scraping it's frustrating, because on the surface it all looks the same, and right under the surface it all goes to hell :). All of a sudden you're trying to explain how to trigger page navigation in an ASP.NET webforms site using a simulated postback, or you're explaining that the nice number the client can see on the page is actually pulled from an ajax call, and since you can't make calls to the endpoint you have to install phantomJS and run the javascript in a headless browser. Of course... then you have to explain endpoint, javascript, and headless.
 

BoberFett

Lifer
Oct 9, 1999
37,562
9
81
Unfortunately this particular gig is uninvited scraping, so we have to work with what we have. The data can be had, but obviously every site is different. Right now I'm dealing with one where I have to generate an ASP postback in order to navigate pages. You try describing this stuff to the client when the page looks just like the page you easily scraped yesterday, and their eyes just glaze over.

I had to write something like that once upon a time. I even had the problem of a very limited result set, so I had to use increasingly detailed dynamic search parameters to narrow the results in sort of a B-tree-esque attempt to whittle down to fewer than 100 results, and scrape tens of thousands of records that way. I was quite happy with the end result, but it was ten times the work of other sites I wrote scrapers for. Fortunately my boss at the time wasn't a micromanager and was more concerned with results.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,453
4,289
75
you can't make calls to the endpoint
Why not? Can't you install MS Fiddler, use it to figure out the codes being sent and received via AJAX, and replicate them yourself? Or are they too complex? Or do they change format too frequently?

I have some experience with this in automated testing with HP LoadRunner. Last I checked (which was several years ago) the Virtual User Generator was free, and could occasionally be helpful in parsing data being sent back and forth as well.
 

Markbnj

Elite Member <br>Moderator Emeritus
Moderator
Sep 16, 2005
15,682
14
81
www.markbetz.net
Why not? Can't you install MS Fiddler, use it to figure out the codes being sent and received via AJAX, and replicate them yourself? Or are they too complex? Or do they change format too frequently?

I have some experience with this in automated testing with HP LoadRunner. Last I checked (which was several years ago) the Virtual User Generator was free, and could occasionally be helpful in parsing data being sent back and forth as well.

Yeah we use Firebug, same deal. In many cases we can call the endpoint just fine, by setting the domain appropriately (although my boss is skittish that this is misrepresenting what we're doing). But the specific example I was thinking of is a single page jsf app, and the endpoints return "faces," whatever the hell they are. I have to dig into it and see if we can get something out of them.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,453
4,289
75
But the specific example I was thinking of is a single page jsf app, and the endpoints return "faces," whatever the hell they are. I have to dig into it and see if we can get something out of them.
Hm, LMGTFY.

"The Java EE 7 Tutorial:Using Ajax with JavaServer Faces Technology"

Looks like this might fit the example of "too complex". :(
 

cytg111

Lifer
Mar 17, 2008
24,561
14,008
136
hmm.. make em run wireshark and send you the dump? Make em run through your proxy and you wireshark it?
 

Train

Lifer
Jun 22, 2000
13,578
73
91
www.bing.com
You know, I shouldn't give the wrong impression, because this client is actually pretty easy going. It's more a recognition that in a lot of areas you can describe the challenging parts to clients in terms they understand, but when it comes to scraping it's frustrating, because on the surface it all looks the same, and right under the surface it all goes to hell :). All of a sudden you're trying to explain how to trigger page navigation in an ASP.NET webforms site using a simulated postback, or you're explaining that the nice number the client can see on the page is actually pulled from an ajax call, and since you can't make calls to the endpoint you have to install phantomJS and run the javascript in a headless browser. Of course... then you have to explain endpoint, javascript, and headless.

Well in that case I hope there were enough of the opposite scenarios... where they expect something to be hard, and you pull a perfect solution out of your back pocket lie a fuckin wizard... for it to even out in the long run.
 

Rakehellion

Lifer
Jan 15, 2013
12,181
35
91
It's almost impossible to describe to a client any of what makes data on a given page easier or harder to get. They can see the data, and everything beyond that is spoken in klingon.

They just tell me what page needs to be done and I tell them a price. They don't need to know details.
 

Cogman

Lifer
Sep 19, 2000
10,283
134
106
They just tell me what page needs to be done and I tell them a price. They don't need to know details.

Depends on if they question you about the price. Especially if 1 page cost $1000 and and the other costs $10000.

Or worse, one page takes 10 days to build a scraper for and another takes 30. Many clients think that every page is just as complex to scrape as the last.
 

DaveSimmons

Elite Member
Aug 12, 2001
40,730
670
126
I feel your pain. We use scraping to integrate with some server platforms, REST for others, SOAP for others.

The scraping code equires much more ongoing maintenance. In some cases the server platform also has site-specific customizations like integration with a Single-Sign-On system such as CAS, Shibboleth or something home-grown.

Explaining to a customer that our software can't magically figure out how to work through their custom authentication sequence without writing new code is fun.
 

Rakehellion

Lifer
Jan 15, 2013
12,181
35
91
Depends on if they question you about the price. Especially if 1 page cost $1000 and and the other costs $10000.

In that case, I just tell them "it's more complex."

I don't burden idiots with the truth. If I have to explain something to a person who won't understand and doesn't need to know anyway, I just bombard them with technical jargon until they're sated.

Actually, the same technique works with picking up women.
 

Leros

Lifer
Jul 11, 2004
21,867
7
81
The really annoying thing about scraping is when the client doesn't understand that changing the structure of the site will break scraping. "All the information is mostly in the same place, it's not my fault the scraper doesn't work anymore". Gah.