• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

awk, perl help?

AtlantaBob

Golden Member
Hi,

I've got some wak scripts that I developed to extract some data from some tab-deliniated files. I'm going to have a lot of data files to manipulate (academic research), so I'd like to have these as automated as possible. To do this, I'm going to need to pass data from one run of awk to another (I think this is easier than putting it all in one big giant script).

so, if I run my first command ./step inputfile and get "4" as an output, how can I pass "4" on to the next script to use?

Will I need to use perl for this? I'm still very new to all of this, so many thanks!

(All of this is using cygwin on my laptop, but I could move over to a dedicated Linux box if neccessary)
 
One way to do it would be to tie the steps loosely together using a shell script that uses backtick operators to capture output. Something like...

#!/bin/sh

OUT1=`awk blahblahblah`
OUT2=`awk "blahblah$OUT1blahblah"`

and so on. There may be issues with quoting or other things I'm not seeing, though - I haven't used awk a whole lot.

In the long run, I think you're much better off using perl. Perl is built for text manipulation and it's much, much easier to handle multiple files or procedures with perl than it is to string together a bunch of UNIX utilities. Perl is kind of weird to learn because it's really idiosyncratic, but if you spend much time doing complicated text parsing it's going to save you a ton of time in the long run. If you want, post an example of what you're doing and we can probably put something simple together as a starting point.
 
Thanks, cleverhandle. I certainly appreciate it. I think I'm in a fairly awkward situation--need to get something put together in the next couple of days so I can start processing data, but I would like to learn perl... oh well.

Here's what I'm trying to do, and here's an example of the data set that I've got. Only variables with an asterisk are important to me. Essentially, I want to get the data from this program into a simple two dimensional array. There are two problems that I'm facing, 1.) the x,y coordinates do not increase by 1, but rather by a stepping factor (in this case, 4, however, it will change for each data set). 2.) If the value of var1 -- the data that I care about here -- is 0, then the original program does not create a line entry for it. (In this case, you'll note that there's an 8 unit gap between the y values of the second and third lines.)




I already have an awk script that strips out the first line, and then computes and outputs the step value (the x,y multipliers). What I'd like to do in the second part of the program is to determine the maximum and minimum values of both x and y, then divide them by the stepping value, so I could get the upper and lower bounds for an array (I'm planning on having a 0-based array and shifting the min x and y values to 0). This is the part that I was thinking I might want to use perl on. After I plug the defined data points into the array, I envision a second pass over the array, assigning the "0" value to any previously undefined reference (if necessary in that language).

At some point, I'll also need to do some other work with this data--things like summing and averaging the data for an entire row of x, or an entire row of y.

Any suggestions you might have as to how I would do this would be great!

Finally, here's a slightly abridged version of how I see the program working.

data --> step factor --> max and min values (divided by step factor) --> define array --> copy data to array
 
I'm not certain I followed all the details. Some clarifying questions...
Originally posted by: AtlantaBob
2.) If the value of var1 -- the data that I care about here -- is 0, then the original program does not create a line entry for it.
So you want to get these zeros back into the data before you do whatever analysis you're doing?
What I'd like to do in the second part of the program is to determine the maximum and minimum values of both x and y, then divide them by the stepping value, so I could get the upper and lower bounds for an array (I'm planning on having a 0-based array and shifting the min x and y values to 0).
From this, I'm reading that you want to "scale" the indexes of the array so that they start at (0,0) and use a step of 1? In other words, applying a linear transformation...

What you're describing is more what I would call "number crunching" than "text processing". Perl can number crunch, but it's not really as flexible a tool for doing that as a spreadsheet like Excel or OOCalc. Unless I had to do a bunch of these, I would use a spreadsheet for the transformations and data exploration. The spot I maybe see perl being useful is reinserting the zero points into the data - that would be clunky in a spreadsheet, though I'm sure it must be possible. I think you could pretty easily construct a perl script that took the mins, maxs, and steps as arguments and worked through the data reinserting lines for the dropped zeros. Then you pick up the output file in a spreadsheet and take it from there.

Is that reasonable? It seems like the path of least resistance to me, even though you end up using multiple tools. At least it should be if you're not working on GB's of data or having to run dozens of these every day.
 
cleverhandle.

I think that you're following me fairly well. I do indeed want to get the zeros back into the data before running an analysis.

Also, you're quite right re: the x and y values and mapping them to an array.

One problem regarding the spreadsheet route: the large size of the data set can overwhelm Excel (at least in the 2000 version, and I think 2003 as well). (e.g. the data set has considerably more than the 65K lines from a spreadsheet.). Also, I do, unfortunately have to do a lot of these. I'm not terribly concerned about the speed at which it executes--I can certainly have a dedicated machine to run some analyses at night--but taking the time to import them into Excel, etc. is a real pain. I've written a few Visual Basic programs before that do rather similar things and the math here is really more arithmatic manipulation than anything else.... I just don't particularly want to have to go to C and become a real programmer to get this done 🙂
 
Originally posted by: AtlantaBob
One problem regarding the spreadsheet route: the large size of the data set can overwhelm Excel (at least in the 2000 version, and I think 2003 as well). (e.g. the data set has considerably more than the 65K lines from a spreadsheet.). Also, I do, unfortunately have to do a lot of these.
OK, that would kind of suck then. Do you have any other tools available for the analysis? Something like SAS, SPSS, or Minitab?


 
That's an interesting idea... I'll have to think about that. (Mostly, I'll have to come up with a reason my advisor should fund my own research, and not his own). The computers on campus that have these and can run them are either old, crowded, or just otherwise annoying.

I haven't used much of the advanced features of SPSS--do you know if I could set up a script to, say, output histograms for 20 different input files?
 
I would certainly think so, but I don't have personal experience there. According to my wife, you can definitely do it in SAS, and probably in SPSS. But she says you'll be a bigger rockstar if you can do it in SAS. 🙂

If that route is available for analysis, then you just need to reinsert the zeros into the data and maybe do the coordinate scaling. You should be able to do that part nicely in perl, following something like this really rough psuedo-code...

(global scalars xcur, xprev, ycur, yprev, xstep, ystep, xmin, xmax, ymin, ymax)
(global arrays xarray, yarray, xsteparray, ysteparray)

1st pass: Find the mins, maxs, and steps

open file
discard first line
while (<FILE> )
chomp
parse line to get xcur and ycur with regexes
if xcur > xprev then xarray[] = xcur, xsteparray[] = xcur-xprev
same for y's
close file
xmin = minimum(xarray)
similar for other mins, maxs, and steps

2nd pass: Reinsert zeros

open input file
open output file
discard first line of input
while (<INPUT> )
chomp
(some logic to check whether the line follows the step found above)
if yes, then write $_ to <OUTPUT>
if not, then construct zero line and write that to <OUTPUT>, then repeat check
close files

3rd pass: Scale coordinates, if desired

(If you can make the first two parts work, you'll have no problem with this.)


That's just how I would structure it and I haven't given it a ton of thought. Also note that I'm not a programmer, I just play one on TV. 🙂 But it should be pretty workable, and still likely to be easier to do in perl than the alternatives (though it will probably be slow).

Sounds like you've got a busy couple of days ahead of you... 🙂

edit: stupid smileys in code
 
Thanks very much! I really, really appreciate your help. Also, thanks to your wife for her statistics software knowledge. Maybe it's time that I take up SAS.

And as far as a busy couple of days... yeup 🙂 Grad school... it's what you do when you're afraid that you might have too much free time in your life!
 
Back
Top