In this paper, we propose a data prefetching scheme, History-Awoken Stride (HAS) prefetching, optimized with a warp scheduler, Prefetched-Then-Executed (PTE), and evaluate the performance on the platform that we developed. Our platform is a single instruction, multiple thread (SIMT) GPGPU environment, supporting OpenCL 1.2 runtime and TensorFlow framework with CUDA-on-CL technology. Enormous amount of executing threads in GPU demands critical memory performance. HAS exploits history table of related memory accesses in intra-warp and inter-warp of the same workgroup as well as among workgroups, and uses address strides and warp status to monitor the prefetching progress of the executed warp. PTE precisely issues warps according to prefetching status from HAS. The experimental results of LeNet-5 inference and 11 PolyBench test programs on CAS-GPU show that our mechanism can achieve an average IPC performance improvement of 10.4%, and 7.8% reduction in data cache miss rate. The prefetch accuracy can reach 67.7%, and the proportion of prefetch request arrived at the appropriate time reaches 48.2%.