Modular exponentiation for public-key cryptosystems is usually accomplished by repeated modular multiplications on large integers. A high-speed design of modular multiplication is thus very crucial to speed up the decryption/ encryption process. In this paper, we first explore how to relax the data dependency existing among the multiplication, quotient determination, and modular reduction in conventional Montgomery modular multiplication algorithm. Then we proposed a new modular reduction algorithm with a smaller critical path delay in hardware implementation. The speed improvement is achieved by reducing the critical path delay from the 4-to-2 to 3-to-2 carry-save addition, and the resulting time complexity of our development is decreased by simultaneously performing the multiplication and modular reduction processes. Experimental results show that our modular exponentiation can obtain both time and area-time (AT) advantages compared with existing work.