Focusing on travel videos taken in uncontrolled environments and by amateur photographers, we exploit correlation between different modalities to facilitate effective travel video scene detection. Scenes in travel photos, i.e., content taken at the same scenic spot, can be easily determined by examining time information. For a travel video, we extract several keyframes for each video shot. Then, photos and keyframes are represented as a sequence of visual word histograms, respectively. Based on this representation, we transform scene detection into a sequence matching problem. After finding the best alignment between two sequences, we can determine scene boundaries in videos with the help of that in photos. We demonstrate that we averagely achieve a purity value of 0.95 if the proposed method is combined with conventional ones. We show that not only features of visual words aid in scene detection, but also cross-media correlation does.