We have proposed Light-OPU, an FPGA-based overlay processorto accelerate a variety of lightweight CNNs (LW-CNNs). LightOPU performs two levels of optimization: (1) Software-level network reformulation, including layer grouping, operation fusionand operation reordering, eliminates redundant memory accessand reduces number of operations in LW-CNN; (2) Hardware-levelmicro-architecture is specifically designed for LW-CNN operations.Meanwhile, the micro-architecture can be used for conventionalconvolutional layer computation since it keeps all hardware features such as those from [36] for conventional CNNs. The flexibleacceleration engine guarantees high run-time resource efficiency,and thereby leads to low latency and high power efficiency. LightOPU achieves 5.5× better latency and 3.0× better power efficiencycompared with edge computing targeted GPU Jetson TX2, and obtains 1.39× to 8× better throughput per DSP and 5× to 8.4× betterpower efficiency compared with recent FPGA accelerators for LWCNNs. Moreover, Light-OPU is fully software programmable, andno FPGA reconfiguration is required for network and applicationswitches. In contrast, existing FPGA accelerators are all designedfor specific LW-CNNs.