We propose an image-text alignment framework to match images with text, and take blog article summarization as the main application. Objects in an image are first detected, from them deep features are extracted and transformed into a space commonly shared with the text. On the other hand, sentences of a blog article are represented as vectors, and are also embedded into the common space. With these processes, cross-modal matching can be achieved. A blog article is then summarized in the representation of images and their matched sentences. In evaluation, we demonstrate the effectiveness of the proposed method, and show that the generated summary makes more sense.