• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

commandline parsing rss feeds

somethingsketchy

Golden Member
Hello all-

I'm currently working on a personal "project" where I...

1. take a RSS feed url (http://www.tntradioempire.com/rss/?type=podcasts&format=rss&path=/AUDIO/podcast for example)
2. download the entire page (with links) via wget: $wget "http://url.link.here..." -O log.txt
3. parse the log with: grep -w http log.txt download.txt (download.txt is a second file)


As of right now I have many lines (in the second file) where each line is:
<enclosure url="http://media.journalinteractive.com/audio/07010930.mp3" length="10323342" type="audio/mpeg"/>

(NOTE: the ########.mp3 is the timestamped audio file according to when the particular audio clip was played on the air - as it is a radio station out in the Mid-West).


What I'm trying to do is further parse the second file (download.txt) to get the exact web address of the mp3 file. So far I've tried some basic regular expressions (I think I tried s/^http/ ) to further extract the http link, however I just get the exact same lines of text.

What grep, sed, awk, or any other Linux command could I try to extract the http link out of the file and then output the http link to another file?

Thank you in advance for your suggestions and have a great 4th of July!

EDIT: forgot a specific detail, that might further assist the reader
 
Last edited:
You've got a Linux environment so pick any scripting language you feel comfortable with and use it's built in parsing libraries to do all the heavy lifting. Trying to parse XML/HTML with regular expressions is an exercise in frustration in my opinion. Why do all the hard work when most language's libraries have all the tools needed already written and tested?
 
What documentation would you recommend to read up on a parsing library?

I think I've saw some links for parsing libraries, but those were for <insert programming language>. I figured I could use just command line arguments to simplify the process, though I'd be up to try writing a small script/program to do parsing, if need be.
 
That's my point, pick a language that you want to write it in. If you're looking for a recommendation for that I would use Ruby, it has an excellent RSS parsing library.
 
That's my point, pick a language that you want to write it in. If you're looking for a recommendation for that I would use Ruby, it has an excellent RSS parsing library.

Exactly, any language you want should be easily installable so choose whatever you're comfortable with. I'd probably go with Perl since I know it pretty well already and CPAN has modules for everything and after that Python simply because it's so popular these days.
 
I looked up the XML:: Parser module, but it didn't seem to have what I was looking for. I haven't looked up the HTML version of the parser module, but I'm uncertain if the module will strip out the http: web address out of the line of text that I mentioned above:

enclosure url="http://media.journalinteractive.com/audio/07010930.mp3" length="10323342" type="audio/mpeg"/

If the webaddress link was surrounded by <url></url> links, then I think I can make slight adjustments with the code. I could use substr() to at least grab "url="http://...mpeg"/" part of the string. but I'm not sure how I could further extract the link...

Perhaps a small regex to search for the first quoted phrase in the string?
 
Last edited:
I looked up the XML:: Parser module, but it didn't seem to have what I was looking for. I haven't looked up the HTML version of the parser module, but I'm uncertain if the module will strip out the http: web address out of the line of text that I mentioned above:

enclosure url="http://media.journalinteractive.com/audio/07010930.mp3" length="10323342" type="audio/mpeg"/

If the webaddress link was surrounded by <url></url> links, then I think I can make slight adjustments with the code. I could use substr() to at least grab "url="http://...mpeg"/" part of the string. but I'm not sure how I could further extract the link...

Perhaps a small regex to search for the first quoted phrase in the string?

In Perl,

Code:
# attempt to match the form url="http://foo.bar/dweezle.mp3" with optional quotes
m/url="?(.+?)[^" ]/i;
# whatever is captured by the first set of parens goes into special variable $1
$theurl = $1;

This will grab the first URL in $_. The regex is a bit imprecise, for the sake of brevity.

The reason you don't get the results you want from grep is that grep returns the entire line when it locates a match.

Given that you are trying to harvest multiple URLs, you might prefer something like:

Code:
# match in list context returns $1, $2, etc.
# global (g) flag says find all possible matches
@arrayofurls = m/url="?(.+?)[^" ]/gi;

Or if you are certain you will only want MP3s:

Code:
@arrayofurls = m/url="?(.+?\.mp3)[^" ]/gi;

There are plenty of reasons to use something more RSS-specific, like maybe XML::RSS::Parser::Lite, not least that relying on just regexes can fall apart if the RSS creator changes their behavior at all. But the above should be a decent quick-and-dirty solution.
 
Thanks for your help Aluvus. I'll try and play around with what you've supplied and see what I can make.

Hopefully I'll have some time today to write up a script and try it out.
 
Well I've partially figured out my own solution, using a few awk commands....

1. wget <insert rss url link here> -O log.txt
2. grep -w http log.txt > download.txt
3. awk '/.mp3/' download.txt > grab.txt #I think I need to come up with more original filenames
4. awk '{ print $2 }' grab.txt capture.txt

Of course now the only question is how to further extract the web address from the following line (example line):

url="http://media.journalinteractive.com/audio/00000000.mp3"

Would a split() or substr() function take care of the extraction that I would need? Or would I need to do something else?
 
That looks close. First thing you need to learn about is pipes. Pipes take the output of the last command's STDOUT and feed it to the next command's STDIN. Converting your code, that makes:

wget <insert rss url link here> -O - | grep -w http | awk '/.mp3/' download.txt | awk '{ print $2 }' > capture.txt

I think that's right. But I never use awk. For things awk is good at I use Perl. But what you want is sed. Here's how I'd do it:

wget <insert rss url link here> -O - | grep 'url="http' | sed -e 's/^[^"]*url="//;s/".*$//' > capture.txt

If you don't like sed, the other thing that has a chance of working is:

wget <insert rss url link here> -O - | grep 'url="http' | cut -d\" -f2 > capture.txt
 
I had (initially) looked up sed, but for whatever reason, I couldn't seem to find what I was looking for. Then the other day (when I had about 6 hours to kill), I looked up the man page for awk, and started to pour over the contents. It was in the man page, where I got the line....

awk '/regrex/' filename

...minus the "> filename2" that I added in there (since I wasn't completely clear on pipes and outputting with pipes). I'll have to try this out when I have more time after work today.

Thanks for your assistance Ken g6
 
Actually nvm, just tested this out in a VM and holy crap it worked! This is what I used, per your suggestion...

wget <insert rss url link here> -O - | grep 'url="http' | cut -d\" -f2 > capture.txt

holy crap that actually worked! Thanks Ken g6!
 
Back
Top