• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

FreeBSD / *nix Question - How Many Files Per Directory Will Degrade Performance?

Superwormy

Golden Member
I'm going to have a lot of files in a directory, undre FReeBSD 5.0. Currently the count is at 52,000 per directory.

That's going to grow, by a lot.

So my question is, at what piont should I STOP putting files in that directory, and start putting them in another? Basically the only thing that happens to those files is PHP will read one in one at a time, by the filename. Is there any good rule to go by when I should start putting it in a second directory so as not to degrade the performace of finding the file?
 
Well, the good rule has always been when it takes you too long to easily find files, or when the "ls" command takes more then 10 seconds to print out the files in the directory. You still want to be able to work with the files "off-line" through the command line, even if you do not intend on using them in that fashion. If your apache/PHP install dies for some reason, you are still gonna have to backup those files or work with them through the shell.
 
Depends on the filesystem. I know ffs based filesystems like ext3 are pretty bad when it comes to huge directories, so I would assume whatever FreeBSD is the same unless they've added hashed directories recently.

Sorry, no FreeBSD system to test on.
XFS:
Creating 10,000 files :
time for i in `seq -w 1 10000`; do touch $i; done

real 0m39.822s
user 0m11.740s
sys 0m25.005s

$ time for i in `seq -w 1 10000`; do stat $i > /dev/null; done

real 1m3.829s
user 0m27.835s
sys 0m30.965s

$ time find -type f -exec rm -f \{\} \;

real 0m27.276s
user 0m9.075s
sys 0m16.640s

ext3:
$ time for i in `seq -w 1 10000`; do touch $i; done

real 1m14.453s
user 0m11.810s
sys 0m56.065s

$ time for i in `seq -w 1 10000`; do stat $i > /dev/null; done

real 1m5.654s
user 0m27.495s
sys 0m30.995s

$ time find -type f -exec rm -f \{\} \;

real 0m22.972s
user 0m8.640s
sys 0m13.350s
 
Nothinman, I think the bulk of the time in your tests is forking - I ran top and bash was taking 30% cpu, and I got these results (latter ones eliminate all the forking in yours):

% time for i in `seq -w 1 10000`; do touch $i; done

real 0m30.002s
user 0m8.110s
sys 0m23.240s

% time stat * >/dev/null

real 0m1.692s
user 0m1.240s
sys 0m0.440s

% time rm *

real 0m0.193s
user 0m0.040s
sys 0m0.160s


Not sure if I'm just overlooking something though.
 
rm * won't work after a certain limit though, there's a "number of arguments" limit that I'm not sure is specific to bash or not, try it with 52,000 files and see if it works.
 
Ah yes, that is true. Actually the strange thing is that mpg321 /music/*mp3 didn't work for me on debian, bash complained that the argument list was too long. However what I pasted above worked, so I guess before it was a string length limitiation, not a limit on the number of arguments (2000 mp3s or so). I'll play with it for a bit to see if I can run into the limitation (gonna take a couple minutes each time so I'll post back later 😛)
 
Hm yep, tried 52000 and it was too long. Still I think that forking for each causes a massive slowdown, I'd say something like rm_orwhatevercommand `ls -1 | head -n 10000`, and then for the next 10k, rm `ls -1 | head -n 20000 | tail -n 10000`, then head 30000, tail 10000, etc etc.
 
Whip up a python program to do it then, or do you think the overhead of python would skew the results too?

I was going to do a small C program, but I'm going to sleep instead. Maybe tomorrow at work, if I get bored.
 
Python definitely didn't slow it down - I tried it 2 ways:

1. make lists of 5000 files each, then shell out and rm $files for each list
2. unlink() in python for each file

1 took something like 900ms, 2 took somethink like 300ms. 😀

#!/usr/bin/python
import os, sys
map(os.unlink, os.listdir(sys.argv[1]))


Most stuff in python is actually alot faster than what people commonly think. All of the modules are compiled C, it's just a matter of how you write your loops and various other things. For example map() is faster than a for loop, since map() is just a for loop turned into a function and therefore executed by compiled C code. Either way you look at it, I'm sure the vast majority of the execution time in this case is spent waiting for the disk.
 
Another thing to consider is inode count. While trying to do this test on an ext3 partiton I ran out of inodes, I believe ffs uses a static inode table generated at newfs time so it would be very easy to run out of them with this amount of files. Filesystems like XFS, JFS, etc use dynamically allocated inodes to avoid things like this.

I just ran it on XFS and it took ~15s for 52001 (I had the python script in that directory heh) files, but XFS is 'slow' for unlinks compared to filesystems like ext3, maybe a full 'stat' would be better.
 
stat'ing took 861ms for the same 15000 files (this is ext3 btw)

edit: actually stat'ing took 200 something, I decided to have it print the output of the stat, and that moved it up to 861. To just stat, edit the script to do os.stat instead of os.unlink)
 
Why only 15K files? He's currently at 52K and growing.

XFS 15K files gave: real 0m0.264s
XFS 52K files gave: real 0m0.821s
XFS 100K files gave: real 0m1.763s
 
I only did 15k because that itself took some 20 or 30 or seconds to touch all of the files, I'm not patient enough for 50k or 100k 😛

Execution time seems to be a pretty flat curve too, 0.017 seconds per file for the 15k, 0.015 for 52k, and 0.17k again for the 100k.
 
Which is why I'm curious to see how ext3 scales up, I don't have a ext3 filesystem with enough inodes to even do 30K files.
 
BTW as far as the original question, depending on the possible max number of files, I would start off with lots of subdirectories if there's no big disadvantage to doing so. Perhaps 27 directories for a-z and other, or if they're numerically named, have directories 0-9, or 0-99, or whatever. Be creative 🙂
 
Originally posted by: Nothinman
Which is why I'm curious to see how ext3 scales up, I don't have a ext3 filesystem with enough inodes to even do 30K files.

Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/hda7 6553600 41479 6512121 1% /home

Hmmm... 😀

going for a million right now, gonna be a while 🙂

edit: oh, I just realized that my seconds-per-file were based on your xfs times.. 😱

I think I'll go get some food while this is going 😛

I'm doing 2,000,000 in one dir, 1,000,000 in another dir, and 100,000 in another. My machine is absolutely decimated right now.. gotta look into scsi 😀
 
Well I'm going to bed, I'll check back later in the morning at work, I hope you time(1)'d the creation, I'd be curious to see how long that took too =)
 
Wow. Well that was jut a lot of answers over my head I think... but thanks guys!

I guess the best solution would be to have lots of directories as its not hard for me to split them up anyways, thanks!
 
Glad we were of some help 🙂

BTW creating the 100000 took something like a few hours, the 1000000 was taking way too long, I just said screw it and stopped.
 
BTW creating the 100000 took something like a few hours, the 1000000 was taking way too long, I just said screw it and stopped.

Did you use a for loop with touch or python? Cause like you said, all those fork()s are the problem there. Creating 100K files here took like 8m with the simple bash for loop on a 1.2G Athlon.
 
Yeah I just used the bash statement, I'm doing some experimenting right now, seems that ext3 (or just general directory hashing? I don't know enough to say) slows down right around 5-6000 files, and *really* slows down over 30-40000 files. I'll post back with some interesting stuff 🙂
 
Back
Top