Originally posted by: drag
Originally posted by: kylef
This isn't something that couldn't of been added a lot earlier, but with the 2.6.x the Linux guys are aiming for the desktop. The 2.4.x series pretty much rapped up Linux as a successfull enterprise-capable scalable server, so it was time to focus elsewere. The server aspect has taken a life of it's own and now large third parties (such as IBM) is dumping code, time, and money into making it work even better then it did in the past.
Actually, one of the biggest changes to the 2.6 series kernel (IIRC) is the O(1) scheduler. This is directly aimed at servers. Linux 2.4.x and earlier had an O(n) scheduler, which scaled poorly with the number of processes/threads on a system. Basically, the 2.4 approach scanned the entire ready list to see if anything should be boosted in priority or re-scheduled at the end of each time slice. 2.6 instead sticks to boosting only those processes that meet certain criteria, eliminating the necessity to scan through the entire list at each scheduler run.
Large enterprise-class servers with hundreds of execution units running were noticing very high system scheduling overhead, and this was the culprit.
It's also a direct aim to catch up with Solaris and NT, which have had O(1) schedulers since 1992.
I know that Windows has implimented Pre-emptive scheduler for ever and a day, but I don't know anything about it being O(1) or not.
Pre-emptiveness when it comes to dealing with end-users only deals what priority (or goodness) is given to user's actions on the system, not anything to do with the 0(1)-ness or not of the scedualer.
(edit: At least when it comes to user initiated actions. Linux has always been multiasking and procceses have a range of 0-21 of niceness (priority) that you can assign. The kernel will alocate time (quantum slices?) round robin style, but still give priority to those proccesses that are less nice. I don't know if the 2.4 kernel did ANY preempting, I assume it did. Probably wrong. I don't know. But with 2.6 the "preemptive" options is specificly refers to stuff initiated by user input. I beleive.)
Now when it comes to enterprise stuff there was a different issue that linux faced.
What scaled poorly in the Linux 2.4 setup wasn't so much the amount of proccesses as much as the amount of proccessors.
The 2.4 kernel implimented one runqueue, one tasklist irregardless of the machine it was being used on.
This worked fine up to a certain point. It scaled just fine with one cpu. 2 cpus? It was still in the game. 4? ok. 8? that's when you start to see issues.
If you look at benchmark graphs (back when they were still doing big benchmarks on w2k vs Linux) they'd both scale fine up to 4 proccessors, Linux would start to drop off at 8 and then after that it wouldn't scale at all.
Linux's problem was that it only had one tasklist, and that was the same as it's global runque.
You see with the single runque/tasklist the scheduler just ran up and down the list looking for proccesses with the lowest goodness and then assigning them to a cpu. That was it. It would read the runqueue, find the next proccess and give that to a cpu to proccess. Then it would do it again for the next one and then the next one, and so forth.
So with 4 cpus it wasn't a problem, and mostly worked with 8 cpus. After that it caused issues, there was no way to make sure that each CPU got the same thread, so that large caches in cpus were kinda wasted. Plus since the scheduler itself was single threaded, it had a single spinlock, and a single runque. It couldn't read all the potential threads and feed them to the cpus quick enough to efficently utilize all the cpus.
So basicly the overhead piled up and anything above 8 cpus was getting wasted. So that's what caused it to be 0(n) on computers that had more then 4. It wasn't so much a 0(n) as much as a brick wall above 8 cpus.
It's not as bad at it seems at first for the 2.4 kernel. When designing the 2.4 setup robustness, simplicity, and speed on 1-2 cpus was priority because that's were Linux was being used mostly at. It realy wasn't being used for the large databases and stuff like it is today.
What the 2.6 0(1) scedualer brought to linux was a independant runque for each CPU. Each cpu has a 140 possible runlevel priorities and now the scedualer takes proccesses from the global runque and alocates it to each cpu more efficiently. It doesn't have to rush between the cpus. it can allocate proccesses/threads at it's leasure. To make sure that no CPU will remain idle it will check every 200ms to see if any are idle, if any are cpus are idle and it doesn't have any threads to feed it it searchs for suitable threads to move from one cpu to another every 1ms.
So that's were the 0(1)-ness comes from. No matter how many cpu's or threads there are it will take the scedular the same amount of time/overhead to allocate threads.
To a end user on a single to quad cpu machine this sort of thing is fairly useless compared to old single tasklist way of 2.4, but the nice part is that the 2.6's scheduler is still very quick so there isn't any real penalty to use the slightly more complicated model on a computer that won't realy benifit from it much.
Of course some of these things were backported to 2.4 by companies like Redhat as far back as 2002 to make sure that their OS was still competative in enterprise stuff. Now that it's all official, though, it makes it much nicer.
There is still alot to be done, though. Like NUMA-based machines or the other specialized platforms used by high end Unix servers and Supercomputers. Stuff like that is were Sun's Solaris is still better then Linux. The scheduler still doesn't take into account the position of the memory, or the size of the memory footprint of a thread/proccess when it comes to deciding weither or not to migrate threads from one cpu to another.
Of course this is were people like IBM come in, they have the hardware to test and the know-how to deal with these sort of challenges.
Now how all that compares to Window's setup, I have not the foggiest clue.