I think you have a few things mixed up.
CPUs can have varying "databus" widths, and still be the same "bitness" of a CPU. A 32-bit CPU could have an 8-bit bus feeding it, but that'd be kinda slow. The "bitness" of a CPU refers, at least as I understand it, to the number of bits (number of places for a 1 or a 0 to take place), with integer values.
I think what you're thinking of is the size of the registers. A 32-bit CPU has 32-bit integer registers, meaning it can represent numbers all the way up to (2 ^ 32) - 1. After that, it gets overflow, where the last available bit (left most, if you're thinking of it going from right to left) is already a 1, as are all the other bits, and there would be a carry over, except there are no more "bits" left for storing this spill-over.
Now with a 256-bit register, you could have integers that go all the way the heck up to (2 ^ 256) - 1. That's pretty damned big. Will we one-day make use of that? Maybe. Not any time soon. Few people actually need 32-bit processing (more people are running into the problem of only 32-bit flat adressing, but that's another issue). This doesn't inherantly mean it could process 4-data elements (each 64-bits in size) at the same time. To do so is called SWAR computing (SIMD Within A Register). This requires all four data elements to have the exact same instruction operated upon them (hence Single Instruction Multiple Data). This can happen in a lot of code, but not everything. A lot of things have to have different instructions performed on them. This means that not everything can be done in SIMD easily.
So a 256-bit functional unit could handle non-SIMD data, I would think, but would only be able to operate on one data element at a time. This would, of course, be a waste.
Four 64-bit functional units would be better in most respects, assuming the data-element size is 64-bits. In this way, it could process four data elements, whether they all have the same instruction performed on them (they'd each have to have their own instructions for them, however), or if they were four different data elements, with four different instructions being performed on them.
Just to be clear, I went from registers to functional units (in our case, the functional unit would be an integer unit). This is because a 64-bit funtional unit is the stuff that actually does the work, and it stores and retrieves its information from registers.
Having four functional units is somewhat, kinda-ish-but-not-really like having SMP on a chip. Four functional units would mean that it would be a 4-way superscalar core (scalar meaning only one, super scalar meaning more than one). The functional units can do their thing on data that is from a particular
thread (currently), and nothing more. So a 4-way superscalar core can perform 8 instructions per cycle (assuming it is pipelined) from one thread. An 8-way superscalar core could perform 8 instructions per cycle (again, assuming it's pipelined) from one thread.
In order to be able to have the functional units work on more than one thread at once, you either have to break the die up into multiple CPUs (called CMP, or chip level multiprocessing), or do something like SMT (Simultaneous multithreading). CMP means that they are physically different CPUs (probably identical) with their resources cut in half compared to a regular one (or cut in forth if it's 4 CPUs on a die, etc). SMT is about sharing the resources of a chip (though some has to be duplicated). In an SMT machine, the CPU can either work on one thread, just like a regular superscalar, or work on as many threads as it has support for (an n-way SMT machine could work on n-threads at the same time).
Um....if you have any other questions, or if I confused you, LMK
