Existing video classification models (e.g., VideoBERT, TimeSformer) often treat movies as generic video streams, ignoring unique cinematic structures such as shot transitions, pacing, and narrative tropes. fills this gap by focusing specifically on scene-level classification —the atomic narrative unit typically comprising multiple shots unified by time and location.
[5] Devlin, J., et al. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL .
[4] Zhou, B., et al. (2018). Movie genre classification via scene categorization. ACM MM .
