ilarly with different configurations, yet outperforms SGD. For our BMUF approach, NBM learns faster yet converges to better solu- tions than CBM. It is noted that NBM experiments with 8-32 GPUs converge to almost the same FER. In terms of testing set perfor- mance, Table 3 shows that in comparison with single-GPU SGD training, MA incurs WER degradations, while BMUF approaches achieve about 5.0% and 5.3% relative WER reductions on Eval2000 and RT03S, respectively. Again, NBM performs better than CBM. We also compare the elapsed time per sweep of data in Table 4. Ob- viously, a linear speedup is also achieved on this task.4. CONCLUSION AND DISCUSSIONFrom the above results, we conclude that the proposed BMUF ap- proach can indeed scale out deep learning on a GPU cluster with al- most linear speedup and improved or no-degradation of recognition accuracy compared with mini-batch SGD on single GPU. In addi- tion to the verified cases for DBLSTM and DNN training on LVCSR tasks, we have also verified its effectiveness up to 16 GPUs for CTC- training of DBLSTM on a handwriting OCR task using about one million training text line images. Our ongoing and future work in- clude 1) Scale out to more GPUs; 2) Evaluate our approach to CNN and other types of discriminative sequence training for D(B)LSTM and DNN; 3) Develop even better parallel training approach.