Abstract—Demand for high performance deep learning (DL)inference in software applications is growing rapidly. DL workloadsrun on myriad platforms, including general purpose processors(CPU), system-on-chip (SoC) with accelerators, graphicsprocessing units (GPU), and neural processing unit (NPU) addincards. DL software engineers typically must choose betweenrelatively slow general hardware (e.g., CPUs, SoCs) or relativelyexpensive, large, power-hungry hardware (e.g., GPUs, NPUs).This paper describes Centaur Technology’s Ncore, the industry’sfirst high-performance DL coprocessor technology integratedinto an x86 SoC with server-class CPUs. Ncore’s 4096byte-wide SIMD architecture supports INT8, UINT8, INT16,and BF16 datatypes, with 20 tera-operations-per-second computecapability. Ncore shares the SoC ring bus for low-latency communicationand work sharing with eight 64-bit x86 cores, offeringflexible support for new and evolving models. The x86 SoCplatform can further scale out performance via multiple sockets,systems, or third-party PCIe accelerators. Ncore’s software stackautomatically converts quantized models for Ncore consumptionand leverages existing DL frameworks.In MLPerf’s Inference v0.5 closed division benchmarks, Ncoreachieves 1218 IPS throughput and 1.05ms latency on ResNet-50-v1.5 and achieves lowest latency of all Mobilenet-V1 submissions(329μs). Ncore yields 23x speedup over other x86 vendor percorethroughput, while freeing its own x86 cores for other work.Ncore is the only integrated solution among the memory intensiveneural machine translation (NMT) submissions.