In this paper, we study and analyze the memory reference of deblocking filter in H.264/AVC baseline decoder based on SimpleScalar/ARM simulator. The simulation result shows that the memory reference is known to be very time consuming in this new video coding standard. In order to reduce the memory reference and thus improve overall system performance, we propose a vertical processing order with efficient VLSI architecture which simultaneously processes the horizontal filtering of vertical edge and vertical filtering of horizontal edge. As a result, the memory performance of the proposed architecture is improved by 4.4 times when compared to software implementation. Moreover, the system performance of our proposal is 129% faster than the advanced architecture of previous proposal.