Yes, RAM I/O (or the FSB) often becomes the bottleneck.
There are plenty of other possibilities though. Do the two threads use variables *next to* each others?
If you have an array a, and thread 0 accesses a[0], thread 1 accesses a[1] and so on, that will hurt performance. Because the CPU caches don't operate with single bytes, but with cache lines (typically 32 byte per line, which corresponds to 8 ints or floats, or 4 doubles)
So if this is the case, and the two threads access data a few bytes away from each others at the same time, they'll have to move that cache line from one core to the other, and back, and forth again, and back. (Since it may not exist in both cores' caches at the same time.
Finally, you may want to use 3 or 4 threads, in order to ensure that there's always a thread ready to run, even if one gets blocked. You generally need slightly more threads than you have cores for best performance.
Without knowing more about how your program works, it's impossible to say what's holding you back.
Another related, but simpler explanation might be that the singlethreaded version just gets better cache locality. It doesn't get as many cache misses as the multithreaded version, for whatever reason. Again, impossible to say without knowing more about your program.
A third option might be that the greater bandwidth usage means your program is seeing relatively higher latencies (because there are more pending requests that have to be served before *your* request returns data,
which causes the CPU's to stall and have nothing to do for some of the time. That might be possible to fix by rearranging your code a bit to reduce dependencies between instructions.
Btw, don't run Sandra with your program running. It's meant to profile your system *alone*. Anything you get while the CPU is busy with other processes is going to be highly skewed and inaccurate.
There's no way to determine how much RAM bandwidth is being used at any instant in time. The reason being, to do that, you have to keep track of everything that happens for a few hundred nanoseconds, which would take so much CPU time, it'd skew the results badly.