commandline parsing rss feeds

somethingsketchy · Jul 4, 2010

Hello all-

I'm currently working on a personal "project" where I...

1. take a RSS feed url (http://www.tntradioempire.com/rss/?type=podcasts&format=rss&path=/AUDIO/podcast for example)
2. download the entire page (with links) via wget: $wget "http://url.link.here..." -O log.txt
3. parse the log with: grep -w http log.txt download.txt (download.txt is a second file)

As of right now I have many lines (in the second file) where each line is:
<enclosure url="http://media.journalinteractive.com/audio/07010930.mp3" length="10323342" type="audio/mpeg"/>

(NOTE: the ########.mp3 is the timestamped audio file according to when the particular audio clip was played on the air - as it is a radio station out in the Mid-West).

What I'm trying to do is further parse the second file (download.txt) to get the exact web address of the mp3 file. So far I've tried some basic regular expressions (I think I tried s/^http/ ) to further extract the http link, however I just get the exact same lines of text.

What grep, sed, awk, or any other Linux command could I try to extract the http link out of the file and then output the http link to another file?

Thank you in advance for your suggestions and have a great 4th of July!

EDIT: forgot a specific detail, that might further assist the reader

Crusty · Jul 4, 2010

You've got a Linux environment so pick any scripting language you feel comfortable with and use it's built in parsing libraries to do all the heavy lifting. Trying to parse XML/HTML with regular expressions is an exercise in frustration in my opinion. Why do all the hard work when most language's libraries have all the tools needed already written and tested?

somethingsketchy · Jul 4, 2010

What documentation would you recommend to read up on a parsing library?

I think I've saw some links for parsing libraries, but those were for <insert programming language>. I figured I could use just command line arguments to simplify the process, though I'd be up to try writing a small script/program to do parsing, if need be.

Crusty · Jul 4, 2010

That's my point, pick a language that you want to write it in. If you're looking for a recommendation for that I would use Ruby, it has an excellent RSS parsing library.

Nothinman · Jul 6, 2010

Crusty said:
That's my point, pick a language that you want to write it in. If you're looking for a recommendation for that I would use Ruby, it has an excellent RSS parsing library.

Exactly, any language you want should be easily installable so choose whatever you're comfortable with. I'd probably go with Perl since I know it pretty well already and CPAN has modules for everything and after that Python simply because it's so popular these days.

somethingsketchy · Jul 6, 2010

I looked up the XML:: Parser module, but it didn't seem to have what I was looking for. I haven't looked up the HTML version of the parser module, but I'm uncertain if the module will strip out the http: web address out of the line of text that I mentioned above:

enclosure url="http://media.journalinteractive.com/audio/07010930.mp3" length="10323342" type="audio/mpeg"/

If the webaddress link was surrounded by <url></url> links, then I think I can make slight adjustments with the code. I could use substr() to at least grab "url="http://...mpeg"/" part of the string. but I'm not sure how I could further extract the link...

Perhaps a small regex to search for the first quoted phrase in the string?

Aluvus · Jul 16, 2010

somethingsketchy said:
I looked up the XML:: Parser module, but it didn't seem to have what I was looking for. I haven't looked up the HTML version of the parser module, but I'm uncertain if the module will strip out the http: web address out of the line of text that I mentioned above:

enclosure url="http://media.journalinteractive.com/audio/07010930.mp3" length="10323342" type="audio/mpeg"/

If the webaddress link was surrounded by <url></url> links, then I think I can make slight adjustments with the code. I could use substr() to at least grab "url="http://...mpeg"/" part of the string. but I'm not sure how I could further extract the link...

Perhaps a small regex to search for the first quoted phrase in the string?

In Perl,

Code:

# attempt to match the form url="http://foo.bar/dweezle.mp3" with optional quotes
m/url="?(.+?)[^" ]/i;
# whatever is captured by the first set of parens goes into special variable $1
$theurl = $1;

This will grab the first URL in $_. The regex is a bit imprecise, for the sake of brevity.

The reason you don't get the results you want from grep is that grep returns the entire line when it locates a match.

Given that you are trying to harvest multiple URLs, you might prefer something like:

Code:

# match in list context returns $1, $2, etc.
# global (g) flag says find all possible matches
@arrayofurls = m/url="?(.+?)[^" ]/gi;

Or if you are certain you will only want MP3s:

Code:

@arrayofurls = m/url="?(.+?\.mp3)[^" ]/gi;

There are plenty of reasons to use something more RSS-specific, like maybe XML::RSS::Parser::Lite, not least that relying on just regexes can fall apart if the RSS creator changes their behavior at all. But the above should be a decent quick-and-dirty solution.

somethingsketchy · Jul 18, 2010

Thanks for your help Aluvus. I'll try and play around with what you've supplied and see what I can make.

Hopefully I'll have some time today to write up a script and try it out.

somethingsketchy · Jul 20, 2010

Well I've partially figured out my own solution, using a few awk commands....

1. wget <insert rss url link here> -O log.txt
2. grep -w http log.txt > download.txt
3. awk '/.mp3/' download.txt > grab.txt #I think I need to come up with more original filenames
4. awk '{ print $2 }' grab.txt capture.txt

Of course now the only question is how to further extract the web address from the following line (example line):

url="http://media.journalinteractive.com/audio/00000000.mp3"

Would a split() or substr() function take care of the extraction that I would need? Or would I need to do something else?

Ken g6 · Jul 21, 2010

That looks close. First thing you need to learn about is pipes. Pipes take the output of the last command's STDOUT and feed it to the next command's STDIN. Converting your code, that makes:

wget <insert rss url link here> -O - | grep -w http | awk '/.mp3/' download.txt | awk '{ print $2 }' > capture.txt

I think that's right. But I never use awk. For things awk is good at I use Perl. But what you want is sed. Here's how I'd do it:

wget <insert rss url link here> -O - | grep 'url="http' | sed -e 's/^[^"]*url="//;s/".*$//' > capture.txt

If you don't like sed, the other thing that has a chance of working is:

wget <insert rss url link here> -O - | grep 'url="http' | cut -d\" -f2 > capture.txt

somethingsketchy · Jul 21, 2010

I had (initially) looked up sed, but for whatever reason, I couldn't seem to find what I was looking for. Then the other day (when I had about 6 hours to kill), I looked up the man page for awk, and started to pour over the contents. It was in the man page, where I got the line....

awk '/regrex/' filename

...minus the "> filename2" that I added in there (since I wasn't completely clear on pipes and outputting with pipes). I'll have to try this out when I have more time after work today.

Thanks for your assistance Ken g6

somethingsketchy · Jul 21, 2010

Actually nvm, just tested this out in a VM and holy crap it worked! This is what I used, per your suggestion...

wget <insert rss url link here> -O - | grep 'url="http' | cut -d\" -f2 > capture.txt

holy crap that actually worked! Thanks Ken g6!

commandline parsing rss feeds

somethingsketchy

Golden Member

Crusty

Lifer

somethingsketchy

Golden Member

Crusty

Lifer

Nothinman

Elite Member

somethingsketchy

Golden Member

Aluvus

Platinum Member

somethingsketchy

Golden Member

somethingsketchy

Golden Member

Ken g6

Programming Moderator, Elite Member

somethingsketchy

Golden Member

somethingsketchy

Golden Member

TRENDING THREADS