I did many more tests, and was able through caching and using RAID on the receiving side to get performance up to 80 MB/s (peak 78.1 MB/s on a set of files totaling 1.25 GB, 83.4 MB/s on a single 568 MB file). Without (significant *) caching, I get 65-70 MB/s on a single 4.5 GB file going from RAID to RAID.
(* In this case, at most 1 GB would be cached on the source side, because that's how much RAM it has; further, the caching would have to be smart, in order to not roll over completely with every transfer of the 4.5 GB file; I saw some indication of such smarts, but this wouldn't make a significant difference overall because of the sizes and rate limits involved.)
I noticed that pushing seems to be significantly faster than pulling (i.e. when transferring files from machine A to B, it's faster to initiate the transfer logging on to A than by logging on to B), and then focused on the best transfer mode instead of averaging the two or mixing them up, etc. I guess that this is because the file system / OS does some further optimization when reading the file locally; conversely perhaps does not do so when those requests come remotely. Of course, this may be simply due to some other unknown inefficiencies; it's just a guess.
On a 4.5 GB file, I saw on the order of 20% CPU on the sending side (2.8 GHz Pentium) and 55% CPU on the receiving side (2.0 GHz Athlon 64). These figures are just eyeball estimates; the CPU utilization was somewhat erratic, going +/- 5-10% around the given figure. Network utilization was around 52%. I think some of the utilization differences are due to the processor differences; differences in the storage and network implementation on both sides might also figure.
In order to factor out the storage, I did some tests with a pure transfer utility. I chose a version of the "open source" TTCP. AnandTech and others reference the Microsoft version NTTTCP, which, in keeping with traditional practice for a lot of benchmarking, seems to no longer be available. I used PCAUSA's PCATTCP 2.01.01.08, with settings as close to AT's as I could get: -l 250000 -n 30000. This version comes with source.
With this utility, I was able to get transfers up to around 110 MB/s, with 90-95% network utilization. CPU utilization was high, around 50% on the sending side, and around 50% on the receiving side. Consecutive runs reported around a 5% difference.
Well, I'm not sure how this data might help you. My interpretation is that I can run the network pretty hard when there are no other bottlenecks; may be running into CPU bottlenecks with application processing during transfers (and I'm sure jumbo frames, if I could use them, might help there), and currently cap somewhere around 70 MB/s effective throughput with fast drives, and further benefit from lots of RAM for file caching, just like everyone else.