I would guess that Microsoft chose a sampling interval of 1152, for its MP3 compression, so that they could truly say that the frequency response goes down to 20 Hertz. With an interval of 1024 samples, one could only get down to 22 Hertz.
And the difference of 128 samples, also factorizes well with 1024.
(Edit 05/23/2016 : ) According to the terminology of some other sources, what I refer to as ‘one sampling interval’, is named “the frame”, while what I have referred to as ‘one frame’, has been referred to as “a granule”.
(Edit 05/31/2016 : ) Another reason seems to be the fact, that both 1152 and 576 are divisible by 3. When a transient is detected, the need seems to exist, always to replace an odd number of frames, with an odd number of shorter frames. It seems that during playback, a count between even-numbered and odd-numbered granules takes place, which also causes alternation between a +1 and a -1 , except for the coefficient (0) , which must be encoded with its sign bit stored. A sequence of ( +1, -1, +1 ) will replace ( +1 ), and a sequence of ( -1, +1, -1 ) will replace ( -1 ) .