That is because larger photo diodes can be used on the sensor which are capable of gathering more light thus increasing light sensitivity.
However as technology is pushed forward and smaller and more sensitive photo diodes can be manufactured the pixel densities can increase without having an adverse effect on noise levels.
Yes and no. It depends entirely on how you define noise, and how you wish to present the final image. One view is to compare per-pixel noise, the other is to compare noise in the final printed image (or fixed display resolution, for that matter).
There are two main noise sources in the images captured by current imaging sensors: read noise, and photon shot noise. (There are actually more noise sources, but this approximation is good enough for the current discussion).
Photon shot noise is a physical phenomenon (random arrival of photons per unit area), and we cannot do anything about it. A larger photosite will collect more photons, so photon shot noise will be lower relative to the signal level. Conclusion: larger photosites = lower noise.
The problem with that argument is that taken to its absurd conclusion, we would build a sensor with only one pixel, since that will minimise photon shot noise.
But there is a trick: photon shot noise is a function of the total light sensitive area, not the pixel size. In other words, if I take one large pixel, and split it into four equal sub-pixels, then I can simply add the values recorded by the four smaller pixels to obtain the same value as the original larger pixel. In short: downsizing the a higher-resolution image has the same (beneficial) effect of reducing photon shot noise as we would obtain by using larger pixels.
Of course, this is a crude approximation, and we really have to take into account the amount of "dead area" on the sensor, which will typically increase as you decrease the size of individual pixels. This again implies that larger pixels would be better in practice.
The other type of noise, read noise, can be reduced by using better technology such as the on-sensor amplifiers in the recent Nikon and Sony sensors. If you can pull the read noise down low enough, then the photon shot noise becomes the dominant noise source. This implies that the "downsample a higher resolution image" approach to reducing photon shot noise becomes very effective. This is why the D800 is at the top of the DxO charts (they use downsampled 8 MP images for their tests). It also explains why the Nokia PureView 808 produces such great-looking 8 MP images.
So here is how I see the noise issue: You have a target size you wish to display your final image at. You also have an "acceptable amount of noise" in mind. If your target display size is small (say, 12x8 print), then you could either use a 12 MP sensor or a 36 MP sensor; downsampling the 36 MP sensor can definitely match the 12 MP sensor in terms of noise performance (think higher ISO, not necessarily extremely high ISO). At some point (very high ISO) the 12 MP sensor may end up winning, depending on the fill factors of the sensors we are comparing.
The benefit of the 36 MP sensor is that it will be able to reach that "acceptable amount of noise" at full resolution under the right conditions (good light, low ISO). The 12 MP sensor will do just as well noise-wise under the same conditions (probably slightly better), but you only have 12 MP's worth of pixels, so less detail.
Based on this view, we should not fight the trend towards increasing sensor resolution, since on average we will be better off with more pixels. (us, and the memory card manufacturers, of course).