Wget sanitizing my XML?

etherealfocus · Jun 29, 2013

I'm using Wget to download a 110MB XML product sheet from a vendor's server on a regular basis for inventory and other updates, but it keeps sanitizing the output file. Not a huge deal, but right now it's taking me ~15 mins every day to find/replace < and > back to < and >. If I call it from the browser it comes through unsanitized, but there's so much text it crashes Chrome when I try to copy-paste... and getting it through the browser isn't very amenable to the automation I'm working on anyway.

Is there a command to keep Wget from sanitizing the output file? And while I'm at it, I'm sure there's a command to specify the result file name and destination but I'm not finding it in the documentation.

The command I'm using is:

wget "url&parameters" HTTP/1.1

also tried wget -o, same result. I contacted the vendor about it and they said their files are all unsanitized and it's gotta be something on my end, but didn't get much more specific.

Cogman · Jun 30, 2013

Wget doesn't sanitize, the server you are hitting is doing that.

Try adding

--header "Accept: application/xml"

And see what happens. My bet is that the guy who wrote your endpoint also wrote an Html version so he could hit the endpoint and see what is going on. (you might need to play with the type a bit, it might be plain/xml, application/text, or some other weird concoction. However, it shouldn't have html in the text.... shouldn't....)

If that fails, then you can do something like this

sed -i 's/</</g' myfile.xml
sed -i 's/>/>/g' myfile.xml

KenJackson · Jun 30, 2013

XML and HTML require the '<' and '>' characters to be replaced with the entities you mentioned everywhere that they're not actually being used for the language. Your browser does the conversion automatically for you.

But you can convert them automatically. For example:

Code:

sed 's/&lt;/</g; s/&gt;/>/g' input-file > output-file

Also, the lynx browser can be used to both download the file and convert it:

Code:

lynx -dump "http://url&parameters" > output.file

etherealfocus · Jul 1, 2013

Thanks Ken and Cog but I'm a little confused by the sed syntax - I'm definitely a wget noob, and haven't been able to find documentation for the sed command since the string 'sed' is so common it shows up a million times when I search http://www.gnu.org/software/wget/manual/wget.html for it. I'm experimenting with the ---header and -O options (the latter to specify filename and location) and will post back with results.

Does this look right?

wget -O=c:\a.xml "url&parameters" HTTP/1.1
sed 's/</</g; s/>/>/g' a.xml > a.xml

or wget -O=a.xml sed 's/</</g; s/>/>/g' a.xml > a.xml "url&parameters" HTTP/1.1

or something else?

Sorry for the noob questions

KenJackson · Jul 1, 2013

When you wrote 'wget "url&parameters" HTTP/1.1', I assumed you were substituting "url" for an actual website URL that you didn't want to name. I repeated it with that thinking. And I didn't check the syntax of wget, but I don't think you can specify HTTP/1.1 that way.

You probably want something like this, substituting the real URL as appropriate, which may include "?something" and multiple "&somethings":

Code:

wget -O- "http://example.com/" | sed 's/&lt;/</g; s/&gt;/>/g' > a.xml

-O- (or with a space after 'O') means write the output to stdout,
| means pipe stdout of wget to stdin of the next program (sed),
s/../../ is a substitute command that makes sed replace the first text with the second,
g is a flag that makes the s command global so it changes all instances on a line,
> writes stdout from sed to the filename that follows.

BTW, there's a checkbox named "Disable smilies in text" below the message box that will prevent commands from being changed to smilies.

etherealfocus · Jul 1, 2013

Thanks for the explanation! I was indeed subbing the url&parameters for the actual url and parameters, which I'd rather not specify since it's sensitive work stuff.

Adding the -O- argument works fine (oddly causes it to output the download text to the screen rather than simply showing a progress bar but whatever), but appending | sed 's/</</g; s/>/>/g' > a.xml causes it to terminate immediately. sed without the > a.xml and > a.xml without sed both cause immediate termination as well. Any ideas?

Here's a copy-paste from my .bat file with only the url and params removed:

wget -O- "url" | sed 's/</</g; s/>/>/g' > a.xml

KenJackson · Jul 1, 2013

Your .bat file? Oh no. Since I work exclusively with bash shell scripts (both Linux and Windows/Cygwin) I didn't think about a batch file.

I can't remember for sure, but I think the CMD shell recognizes double quotes, but not single quotes. So you could try replacing the single quotes with double quotes and see if it works.

etherealfocus · Jul 2, 2013

Nope, same results. It fails even without the sed command... apparently even > a.xml is no bueno. Ideas? :/

Search

Wget sanitizing my XML?

etherealfocus

Senior member

Cogman

Lifer

KenJackson

Junior Member

etherealfocus

Senior member

KenJackson

Junior Member

etherealfocus

Senior member

KenJackson

Junior Member

etherealfocus

Senior member

TRENDING THREADS