the reason for my absolute hatred of mechanical HDDs

exdeath · Oct 7, 2011

Do this a few times a day, or even a week. Not hard to see how this could completely ruin your day and destroy your work schedule and absolutely kill your productivity.

Shortly after grabbing that, it shot up to 5 hours remaining at 1.5 MB/sec.

1.5 MB sec? In the year 2011? Seriously?

Hell, it took it over 10 minutes just to crawl the directory tree and estimate before it even STARTED.

Mechanical / magnetic data storage is 60 years obsolete. They were obsolete when CPUs exceeded 10 MHz. DIAF already kthx.

High density STT-MRAM cannot come soon enough. SRAM speeds for my primary data storage please.

PS: I've gone all solid state at home and personally, even having forsaken and destroyed all optical media in my home in favor of USB 3 flash drives. I'm even willing to shell out of pocket for SSDs on MY work provided PC. Does me little good working in IT when nobody else enterprise wide has done the same.

Elixer · Oct 7, 2011

Your HD must be fragmented to hell.
Any HD put out the last few years shouldn't dip below 30MB/sec

With almost 2 million files, I am not shocked that it took over 10 mins.

jlee1 · Oct 7, 2011

lol i'm thankful for the usb 3.0 on my x220. planning to swap out the 7200rpm hdd for something ssd

Red Squirrel · Oct 7, 2011

Your bottleneck here is not the hard drive, it's the 10mbps link that this E drive apparently is on. 😉

Transfer took about 2-3 minutes. Coming down over gigabit link from a raid 5 array to a single drive. Faster if I go the other way. With gigabit the drives are the bottleneck, but way faster than what you show.

exdeath · Oct 7, 2011

That's also a single large file, not 2 million small files. Lots of small files vs one big file, or worst case scenario for HDDs, is par for the course around here.

That was with USB 3.0, but as you can see it's a mechanical limit that wouldn't have even challenged USB 2.0.

Same drive synthetic benches around 100 MB/sec sustained sequential when I'm playing around with them on the work bench, so I know it's not a driver/USB3.0/link cable/etc issue.

But real world is NOT sustained sequential. In the real world, where nobody has what YOU want them to have, HDDs SUCK.

1.5 MB/sec... thats slower than the ISA bus.

gevorg · Oct 7, 2011

I get at least 30MB/sec on USB2.0, there is something wrong with your drive/computer.

exdeath · Oct 7, 2011

gevorg said:
I get at least 30MB/sec on USB2.0, there is something wrong with your drive/computer.

Not copying this folder you wouldn't. Mechanically impossible. Not without an SSD.

I'd rather move one large 20 GB archive file than do this.

Mr. Pedantic · Oct 7, 2011

Elixer said:
Your HD must be fragmented to hell.
Any HD put out the last few years shouldn't dip below 30MB/sec

With almost 2 million files, I am not shocked that it took over 10 mins.

Do you not know what the word 'random' means?

Cerb · Oct 7, 2011

If my math isn't mistaken, you have an average file size of <5k. Why on earth you should need over two million small files (assuming the already-copied files are similar) for any reasonable program is beyond my comprehension. As such, I would blame Java. Java is very good to blame when you have file count bloat.

Then again, I've also never understood why the FS/HDD drivers haven't been adjusted to read large streams of data and discard unneeded data. IE, with that many small files, why not read a swath of the drive with maybe 30 of them, and discard all but the file data you want in the end, then remove those files from the list of files that need to be copied? It would significantly speed up such reading operations, which can sometimes take longer than copying the entire drive would.

Also, have you considered using backup software (IoW, only copy modified files)? I don't see any good excuse for dealing with that kind of copy on a regular basis. By experience, I would say it should bring the operation down to <1hr (discovery, compare size and mtime), possibly <30min.

gevorg said:
I get at least 30MB/sec on USB2.0, there is something wrong with your drive/computer.

Please. You get 30MB/S on USB 2.0 copying very large, low-fragment-count files. USB 2 will make you beg for eSATA when you go copying small files, even to/from flash. USB 3 handles random access nearly as well as an internal drive, but not USB 2.

Red Squirrel · Oct 7, 2011

What exactly does this folder consist of? I understand multiple files takes longer than a single file, but it still should not take THAT long.

exdeath · Oct 7, 2011

Not my computer, not my place to say what someone else uses their computer for or what kind of files they should or shouldn't have. You can't just say "oh don't develop in Java" when it's somebody's job.

MarkLuvsCS · Oct 7, 2011

It's easy to see mechanical drives slow to a crawl. Run any is your favorite hdd bench programs and check out the 4k performance. I really dying know how anyone can argue that speed it's totally unheard of.

Sent from my MB860 using Tapatalk

Shmee · Oct 7, 2011

which HDD are you using?

exdeath · Oct 7, 2011

WD Scorpio Black 2.5" 250GB from an HP 8540 notebook.

Again something I have no control over. If it were up to me, anything without an SSD would be thrown in the trash.

Cerb · Oct 7, 2011

exdeath said:
Not my computer, not my place to say what someone else uses their computer for or what kind of files they should or shouldn't have. You can't just say "oh don't develop in Java" when it's somebody's job.

No, but you can blame Java. Java is a great punching bag. Having that many files to copy is a great exhibit of some of what is wrong with Java (some of it is technical, but as much is cultural, and the cultural part is going to be nearly impossible to get away from, unless you can sneak in Scala).

It doesn't make the transfer any faster, but that's also a problem of the HDD controller drivers and the FS. It would be nice, FI, if the FS driver were intelligent enough to put those small files together on the disk, in a directory-wide extent (or, even multiple directories, depending on actual access patterns), and group files in subdirectories nearby, as well, to reduce seek times when reading groups of those files, instead of them being scattered all over the disk. Likewise, as the software-level IO queue filled up, reads from nearby files would be turned into grouped serial reads, keeping the file chunks you want only (instead of dispatching them and causing your IO to crawl).

Methods of improving such performance on spindle drives have been known for some time, but allocation choices tend to favor the common cases (sequential and random reads of larger files, non-random writes of small files), rather than worst cases. If you don't fit the common case, you should get an SSD--as such, it likely won't change. At least SSDs are getting down to reasonable prices.

For whatever reason, the big vendor SSD selection generally sucks, despite that business notebooks and workstations would benefit most from SSDs, and they tend not to need large local HDDs. You've got quite an uphill battle on that front, and personally, I do not understand it one bit.

dawp · Oct 8, 2011

well, at least you ain't loading from tape. That would take a bit longer.

exdeath · Oct 8, 2011

dawp said:
well, at least you ain't loading from tape. That would take a bit longer.

What kind of tape?

Current tape systems like LTO store data linear, not random, and are actually very very fast, faster than HDD max sequential even. Well over 100 MB/sec uncompressed. Getting off tape would be very fast... getting it to the tape in the first place would still requiring reading off this HDD at 2 MB/sec...

C64 cassette tapes? Floppies? Yeah they were slow but in relative terms during their use we were only storing/retrieving a few kilobytes of data.

Now we have 100s of gigs of data but magnetic media transfer speeds haven't really improved much in 40 years. 20 MB/sec to 100 MB/sec in 40 years doesn't mean much when your data size has grown from 1 MB to 5 TB in the same 40 years.

Poor software engineering practices don't help either. Software developers have gotten careless because disk SPACE is free, but don't think about the consequences to disk IO when you have to download, decompress, and install that 500 MB printer driver... space means nothing when you don't have the IOs to utilize it IMO.

dawp · Oct 8, 2011

DAT, we use them for our switches, a large office can take hours. and that isn't as large a copy as you're doing.

luckily, we only have to do that when a switch crashes, load a new generic, or is having major issues and the only way to recover is from tape.

Gotta love lucent/alcatel or whatever they call themselves now.

of course the fastest processor in a 5e switch is 33mhz if I remember correctly. nothing like working with 80s/early 90s tech.

Cerb · Oct 8, 2011

exdeath said:
Current tape systems like LTO store data linear, not random, and are actually very very fast, faster than HDD max sequential even. Well over 100 MB/sec uncompressed. Getting off tape would be very fast... getting it to the tape in the first place would still requiring reading off this HDD at 2 MB/sec...

However, if the HDD were to store the files in a manner similar to how they were accessed, or had a set of rules for directories filled with small files, you could get similar speed from a HDD, and an HDD would still have the ability to do random seeks, on top of that. In some ways, it was better back in the days of 'dumb' allocators, putting writes practically in sequence, even at the expense of high fragmentation for large often-edited files. It is a solvable problem, there's just not much interest, since it would add great complexity for the benefit of a small portion of users.

If SSDs were not an option on the horizon, it likely would have been dealt with, and I'm about 99% sure that NTFS could handle it in a backwards-compatible way (+10MB or there-abouts extents dedicated to small files, such that several levels of a directory tree could be in a single extent, and they would be optimally re-grouped during defrag passes; the extent would need to be read to read one file, but if there was enough RAM, that could allow all other files to be cached with it, allowing faster editing and copying; if plugged into an older version of windows, metadata about the extent could be ignored or removed, and you wouldn't get the benefits when editing/copying those files in said older OS version). The manhours required to develop, test, and maintain this sort of thing, however, would be enough that I could see people working with FSes considering it, and then deciding that it's too much work for too little gain.

What really sucks on your end, though, is that business users who would most benefit from high-performance SSDs (IE, those old Kingstons and the like with high WA and HDD-like 4KB random performance would not quality, but you don't need the latest and greatest) are all too often either stuck with a vendor that makes them impossible options, or that are only available in higher-end models than you might want (you just want an SSD for C:, not a Xeon, Quadro, and four computers-worth of cooling), and then they won't even tell you what you're getting. The best I've seen from big vendors is that Lenovo tells you it uses MLC flash...well, that's more than Dell tells you, but for your needs, you want a real make and model[, dammit, ]and want it on a lesser computer. So, if you finally convince the guys in charge of the wonders of good SSDs, then what are the actual SSD options you'll have, when they go to buy new computers? Even with an uphill battle, if you can specify one from Newegg, you'll be better off than many places.

Poor software engineering practices don't help either. Software developers have gotten careless because disk SPACE is free, but don't think about the consequences to disk IO when you have to download, decompress, and install that 500 MB printer driver... space means nothing when you don't have the IOs to utilize it IMO.

I see you use HP printers 😛. I think HP still makes some great small workhorses, but for drivers and software, come on over to Samsung and Brother.

Management of software is also quite often to blame. When you have to work with what already exists, you don't have the option to rip the guts out and make it smaller and better. You might also have tight deadlines, and not enough time to write all the code properly. You might have had poor communication about requirements, too, without time to make changes the right way. You may also be given coding and regulatory requirements which stupidly enforce (or your management considers them to do so) technical constraints that serve no real purpose but imaginary CYA on their part. On top of that, it's so often easier to convince non-technical people that gradual modification is superior to updating requirements and re-implementing. How much software development can improve in short timespans is quite often lost on those in charge, and projects of a certain size can't be stealthily rewritten through conspiracies between devs, admins, and users, hidden from management (been there, done that 🙂). On top of all that, you could be dealing framework lovers, or people who think in some other language, and write the language you're using as if it were that (never worked with anyone like that, but have fixed horrible buggy bloated code made by such people).

For the pictured case, Java, like some other languages, forces many more files than really should exist, with many of them being only a handful of real lines of code, and half of those just naming wrappers of various kinds. While java isn't alone in this file=module thing, Java is insane when it comes to how many files you end up needing for what other languages let you do in a few dozen lines of code in one file.

Voo · Oct 8, 2011

Cerb said:
No, but you can blame Java. Java is a great punching bag. Having that many files to copy is a great exhibit of some of what is wrong with Java

You want to blame java for creating one file per class, because that is disadvantageous when copying class files around on a highly fragmented filesystem? Oh well. The reasons they implemented it that way easily outweigh this basically non-existent problem. Why is it basically non-existent?

Because nobody in their right mind copies class files around. A developer will just get the source from the svn server or whatever and if you deploy a java project what do you want with unbundled class files? Jar files exist for a reason and one of them is to avoid copying thousands of small files around.

But still, even IF we are copying class files around, there's no way that would be 9gb of data, so there are some larger resources in there which should be much faster to copy around.

PS: And people are much too eager to throw away code and start from scratch because it'll be much nicer. Yep rewriting your whole codebase so that you have.. hopefully exactly the same product 2years later (in practice you've swapped known bugs against hundreds of new, unknown ones) is a great way to make your customers happy - netscape really showed us, how well that works 😀
As a programmer it's a nice temptation, because who doesn't like to create a new architecture, incorporate all the stuff learned in the past and hey it's a whole lot more interesting than making incremental improvements. But we should never forget that it's one of the easiest ways to kill a product or even company (hey MS tried to rewrite word from scratch and make it so much better, ever heard of that product? No, because they killed it after several years before even releasing version1; borland also did it several times, although in their case they didn't continue working on their existing product, so after they finally got it released they found out that their competitors had used their time to add lots of new useful features and iron out bugs..)
Goodness I got somewhat sidetracked here, but considering all those billions wasted in the industry by that stuff, it's easy to see why I deem it important - though I'm all ears for several large codebases that were rewritten from scratch and it turned out as a complete success, don't know of any.

Spicedaddy · Oct 8, 2011

2 million tiny files... It would probably be much faster imaging that folder with TrueImage and restoring it to wherever you wanted it.

MagnusTheBrewer · Oct 8, 2011

The bottleneck is doing this during work/productivity hours.

paperwastage · Oct 8, 2011

http://i.imgur.com/bWeli.png

add a RAM drive 😀

zip up the files on the disk. unzip in the RAM drive and manipulate them. zip them back up from the ram disk onto the disk

EDIT: lemme see how much slow doing stuff(multiple small files) can be on a RAM drive... any ideas/test files?

taltamir · Oct 8, 2011

Elixer said:
Your HD must be fragmented to hell.
Any HD put out the last few years shouldn't dip below 30MB/sec

With almost 2 million files, I am not shocked that it took over 10 mins.

Guys, look at that file count!
it has nothing to do with fragmentation. He has nearly 2 million files, which means he is copying a directory with a lot of small files. Which means 2MiB/s is actually pretty good for a mechanical HDD. (it can drop well below 1 MiB/s)
In fact, quick math shows that the average filesize in that 8 gig directory is 4.856100581465154971612836829557 KiB. Just slightly over the size of a single sector in a modern 4k HDD or an SSD. (the actual size of the data is probably much less, but a minimum of one sector must always be taken. Storing 10 bytes of data on the HDD still takes a whole 4kb full sector on the disk)

It is worth mentioning that this is NOT your standard workload. In part because programs were designed with HDD in mind using various techniques to mitigate their atrocious random speed.
For example a video game would pack many thousands of assets (which are each their own file) into a large (~500MiB per file in your typical game) zip file. It will then read it sequentially at 100+ MiB/s into ram, it will then extract it in the ram, copy the assets to vram, and manipulate them there. This is one of the things games do during loading screens.

RAM is still much much faster than SSDs. And there are ways around the limitations of HDD in most workloads.

Cerb · Oct 8, 2011

Voo said:
You want to blame java for creating one file per class, because that is disadvantageous when copying class files around on a highly fragmented filesystem? Oh well. The reasons they implemented it that way easily outweigh this basically non-existent problem.

Insane amounts of files for the amount of added functionality, creating a hard to manage mess all too often, is not a non-existent problem (the OP's copying speed is just a nice side effect). The culture of adding libraries and frameworks every other day also helps make it worse, of course--Java alone isn't that bad, but Java applications tend to become that bad.

Ripping out an application's guts when it's a big complex application that is fairly well-documented, including unfixed bugs (the kind that are easier to work around than fix, due to risk): generally bad. Ripping out an application's guts when its buggy by design, can be replaced modularly, with a quality live testing environment: depends. I've had to deal with code that could have made it to TDWTF, were I inclined to copy it, anonymize it, and try to make up cheesy stories.

the reason for my absolute hatred of mechanical HDDs

Lifer

Lifer

Member

No Lifer

Lifer

Diamond Member

Lifer

Diamond Member

Elite Member

No Lifer

Lifer

Senior member

Memory & Storage, Graphics Cards Mod Elite Member

Lifer

Elite Member

Lifer

Lifer

Lifer

Elite Member

Golden Member

Platinum Member

IN MEMORIAM

Golden Member

Lifer

Elite Member