Sorry about the multiple posts. I'm trying to get up to 10,000 in the next week
Since they count memory in bytes, two modules is twice the number of bytes. 1G is still counted as 1G when it is made up of two 512M modules.
Current memory modules are 64 bits wide, which is 8 bytes wide. Dual channel, using two modules, makes memory accessible 128 bits, or 16 bytes, at a time. It depends on the memory controller and CPU whether the modules can deliver 128 bits simultaneously to the CPU or whether it is done serially, one 64 bit group after the other. Since the two external channels are independent, you could stagger access by 1/2 and in effect get double the bandwidth even with multiplexing 64 bits twice. In the case of Athlons, I believe Nvidia controllers just buffer the double reads, and the Athlon can't get it all in twice as fast; Nvidia uses a "smart" controller to predict what the next memory read request will likely be. That makes the Athlon get requests filled modestly faster. If I recall correctly, P4s past a certain core have 128 bit wide cache lines (twice the width of Athlons and former P4s, and all Celerons), so dual channel could get cache lines filled at truely double speed, although I don't know if the P4 has a 128 bit external data buss (instead of 64) to match. Now, that doesn't mean the P4 can uses all that bandwidth at todays CPU speeds, but cache misses should be less deadly. Modern CPUs depend on having near 100% cache hits to operate at anywhere near their optimum. (Although 100% hits is impossible of course.) Remember CPUs operate at 10x the rate or more internally as externally, so waiting on memory kills the effective speed. That's why they have a big cache to begin with. If you deal with tremendous amounts of data that is only accessed once, such as possibly encoding videos, a cache loses most of its effectiveness for data, and you go back to being dependent on memory bandwidth.
I could babble on longer

but I hope that does it.