Compared with Equation (7), we should set γm about N times of γsto ensure enough per mini-batch contribution. As we know, an im- portant role of momentum in SGD is to attenuate influence of noisy component in gradients by history update information. For per split optimization, lack of update information from previous blocks will weaken the attenuation effect while larger γm enlarges influences of noise. Consequently, model-update resulting from a single block is pretty noisy and the performance of MA becomes poorer with more parallel workers.Assume constant BLR ζ and BM η are used in BMUF. Since ηbuilds links between successive blocks, Bi’s contribution to ∆b is