Semantic Undersanding and Evolving Interaction Tracking in Long-form Multimodal Datasets

[Live-Demo] [Paper-Draft] [Slides] [Contact]

Story Semantic Relationships from Multimodal Cognitions

Vishal Anand 1, Raksha Ramesh 1, Ziyin Wang 1, Yijing Feng2, Jiana Feng 1, Wenfeng Lyu 1, Tianle Zhu 1, Serena Yuan 1, and Ching-Yung Lin 1,2

1 {va2361, rn2486, zw2605, jf3283, wl2733, tz2434, sy2657, c.lin}@columbia.e2du;{yijing, cylin}@graphen.ai 1 Columbia University, New York, NY, USA

2 Graphen AI, New York, NY, USA

ABSTRACT

We consider the problem of building semantic relationship of un- seen entities from free-form multi-modal sources. This intelligent agent understands semantic properties by creating (1) logical seg- ments from sources, (2) finds interacting objects, (3) infers their interaction actions using (4) extracted textual, auditory, visual, and tonal information. The conversational dialogue discourses are auto- matically mapped to interacting co-located objects, and fused with their Kinetic action embeddings at each scene of occurrence. This generates a combined probability distribution representation for interacting entities spanning over every semantic relation class. Using these probabilities, we create knowledge graphs capable of answering semantic queries and infer missing properties in a given context.

CCS CONCEPTS

•Computingmethodologies →Naturallanguageprocessing ; Information extraction ;Lexical semantics ;Activity recogni- tionandunderstanding ;Discourse,dialogueandpragmatics;Knowl- edge representation and reasoning;Scene understanding.

KEYWORDS

natural language processing, information extraction, lexical seman- tics, video understanding, speaker identification,video to text

ACM Reference Format:

Vishal Anand, Raksha Ramesh, Ziyin Wang, Yijing Feng, Jiana Feng, Wen- feng Lyu, Tianle Zhu, Serena Yuan, and Ching-Yung Lin. 2020. Story Se- mantic Relationships from Multimodal Cognitions. In Proceedings of the 28th ACM International Conference on Multimedia (MM ’20), October 12–16, 2020, Seattle, WA, USA.ACM, New York, NY, USA, 5 pages. https://doi.org/ 10.1145/3394171.3416305

1 INTRODUCTION

With the growing popularity of common sense inference, deep video understanding aims to to automatically deduce relationships between entities in long duration multi-modal inputs and extract knowledge to address varied query-types. With the surge of break- throughs in text models [ 5], many tasks have started leveraging

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profitor commercial advantage and that copies bear this notice and the full citation on the firstpage. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Request permissions from permissions@acm.org.

MM ’20, October 12–16, 2020, Seattle, WA, USA

ACM ISBN 978-1-4503-7988-5/20/10…$15.00 https://doi.org/10.1145/3394171.3416305

transfer-learning on textual data, and recent works now leverage video data, such as event recognition, object detection, and activity recognition. However, a significantknowledge gap exists between jointinferenceofmultipleaspectsofthevideopropertieslikeaudio, transcribed conversation and scenes. Key to this intelligent agent is the isolation, fusion and analysis of multimodal features with sufficientannotations for training robust models. In our work, we take a holistic perspective to address the semantic understanding problem in consideration of all available modalities to infer hidden information, and eventually construct knowledge graphs. By fus- ing different modalities, we gain better understanding of entities, relations and events within movies. In particular, we focus on in- corporating reasoning from text and scene, the methodology used including conversational dialog discourse, shot and scene detection, object detection and mapping, and face detection.

2 BACKGROUND AND RELATED WORK

Many approaches for video understanding adopt the question an- swering prototype for evaluation. Knowledge graphs are known for capturing both concepts and their pairwise relationships, and their application have been successful to machine learning appli- cations including Web search and social media [ 6]. Knowledge Graph construction considers three generalized tasks: 1) knowl- edgeextraction,2)entitymapping,and3)dataintegration.Workon multimodalapproachesinvolvingknowledgegraphs[ 8]bridgesthe gap in existing state-of-the-art approaches as it unifiesknowledge graphs and deep neural networks in a novel end-to-end learning framework by incorporating external knowledge into video clas- sification.While we propose a multi-modal formulation, [ 8] has a single-modal approach that takes unstructured text as input and creates a Knowledge Graph with 5 components (Entity Mapping, Co-reference Resolution, Triple Extraction, Triple Integration, and Predicate Mapping). In the domain of social relationship under- standing, most existing studies focus on modelling relationships in still images using coarse to finehierarchical categories [ 16]. [10] adopts a dual-glance model to make a coarse prediction from ob- jects and appearances while the second glance use contextual cues.

[12] propose multi-scale spatiotemporal reasoning framework to capture visual relations between entities. The use of multimodal frameworks in the context of relationships and interaction predic- tions are not widely explored, which we attempt to address.

3 METHODOLOGY

We place higher importance on context over interactions of indi- vidual entities in any of the multi-modal sources. Firstly, we divide resource-rich video files to identify shots that are representative of change of reference of vision. Second, we find a contiguous

Figure 1: Architecture Schema for multi-modal training.

narrative among the shot-segments, i.e., change in storyline is de-termined based on when scenes change. This is captured by change in video based light features and maximizing sum of shot-to-shot distance products. With scenes identified as contiguous sets of sto- rylines, we extract their video’s textual description, perform audio transcription from the extracted audio, and finddiarized texts from individual entities in the scene.

Next, we extract faces in each frame, cluster them and associate with provided ground truth for each movie, and in parallel, find each face’s body frames. For each frame, we also extract common objects occurring in the scene to enrich data. Each of these entities are then fed into a kinetics model which emits their probable ac- tion being performed. All of these are then used to train a unified model which associates the actions and co-location of entities to produce a set of probabilities for each pair of entities. We also have human evaluation to contrast the performance of our model and add comments on hardness of the problem.

4 DATASET

The HLVU dataset (Table 1) has 10 open source movies sampled from paper [4]. The training set includes four long and two short movies, while testing set includes two long and two short movies. The dataset is annotated with relations between key characters, locations, objects, action events, along with names and images of keyentities.Theobjectiveistolearn120semanticrelationsbetween the entities in the dataset using multimodal inputs.

5 BUILDING BLOCKS

Shot & Scene detection

We perform shot detection on all movies and identify key frames. The frames features extracted are grouped together by similarity scores using sliding-windows into scenes as illustrated in Fig. 2. A scene consists of a sequence of adjacent shots that are semantically related and represent a story within a movie. We cluster multiple shots together and uses some shot-embeddings to deduce if they

Figure 2: Scene Detection Process

are temporally aligned. We analyze and extend the scene detection work by [13] and adopt the dynamic programming approach to minimize the normalized cost function to group shots together, however we found that using HSV color histograms for features to predict scene boundary accuracies were far more superior than using deep features for movies.

Object detection and mapping

The entities provided in the HLVU dataset [ 4] include person, ob- jects,locationsandconceptsforwhichrelevantimagesareprovided for mapping. The locations and object entities are localized within scenes using SIFT based feature matching to handle varying scales and crops. Keypoints within a frame are computed and descriptors are extracted from each keypoint. Euclidean distance metric is used to compute the best matches between a template and a frame.

Face detection

The dlib’s [9] face detector identifies faces in each video frame. The face detector uses Histogram of Oriented Gradients (HOG) feature combined with a linear classifier. With an image pyramid, and sliding window detection scheme, the detector is able to detect

faces at various scales and locations. These faces are aligned by the 5 facial landmarks detected by dlib’s facial landmark predictor.

Face-Entity Mapping

The aligned faces are mapped to 128D vectors using dlib’s ResNet model trained with triplet loss [ 15]. We also compute the face embeddings for profileimages of important characters. We identify thefacesbycomparingitwiththeknownfaces.Theunknownfaces are first clustered using Chinese whispers [ 1]. Then, we predict the most similar person with K-nearest neighbors (KNN) for each face detected in the video. The face clusters are labeled with the majority label predicted by KNN. The face clusters with too few entities are labeled as ’unknown’.

Scene description

We generate scene descriptions at shot-level based on [ 18] Frame- wiseRGBandopticalflowfeaturesarepooledwithregionproposals extracted from Faster R-CNN detector. The network is trained on ActivityNet [2] and we observed that the descriptions generated do not accurately capture background objects and scenes. This suggests the need to fine-tunean off-theshelf object detector on a benchmark movie dataset to generate richer descriptions in movies.

Audio transcription

We use CMU-Sphinx to extract text corresponding to all sound utterances by the cast of given movies. From the initial results, we cross-verifywithGoogleAPIforeachsoundutterance’stime-stamp. The quality of either of the processes were not very satisfactory.

Audio emotion embedding

Acoustic features like pitch, energy frequency and spectral coeffi- cients like Mel frequency cepstral coefficients(MFCC) are known to be crucial for emotion recognition and are under-explored in social relationship understanding. We use audio clips for each speaker on scene-levels and use VGG-ish model [7] to extract 128- dimensional semantically compact representation for each second of audio. These are used in downstream relation-classificationtask. By using labelled relationships from training data with audio em- bedding, we train a supervised GRU network model.

Speaker diarization & Face mapping

We extend the CMU-Sphinx to produce speaker-separated audio splits with time-stamps to help with speaker-identificationbut the output was not satisfactory for the movies. Later we use Google- API with speaker identification and found the results similar on our larger duration audio files.Using the output of both of these systems, we use scene based face-mapping to assign speakers with names by using dlib’s 68-facial landmark predictors to capture the shape of lips and estimate lip motion according to relative vertical shape change with respect to face size. We maintain a running average over frames to predict if the target person is the speaker.

Knowledge Graph relational queries

We construct the knowledge graph to represent what the system has learnt from the movie, where vertices represent entities includ- ing people, location, concept and organization, and edges represent

relations between two entities. The vertices and edges, together with the confidenceof relation prediction are ingested to the graph database on Graphen’s Ardi Platform. We also use the Graph Ana- lytic module of Ardi to traverse the graph, retrieve relations given a set of conditions, and get all possible paths between two entities.

6 EXPERIMENTAL-SETUP

Modality: Scene & Kinetics

To predict a relationship between a pair of entities, we first co- locate and extract individual scene-level tracks for all entities that occur in a scene. A track contains cropped frames of the people and objectentitiesandistemporallyalignedwiththeactualscene.Since clothingandactivityareimportantsemanticattributescontributing to model social relationships in videos [ 16], we extract bounding boxes for a character’s body regions using Single Shot Detection (SSD) [11]. The bounding boxes for the character’s body regions are localized based on the maximum intersection over union (IoU) with the recognized faces. The scene model uses three parallel video streams - tracks for the pair of entities and the scene as a whole. Features are extracted from the I3D spatio-temporal convolutional network [ 3] used for activity recognition. We experiment with different durations of video clips to co-locate entities in a scene and found that choosing a 300frame margin was optimal. The features extracted from the video streams are concatenated and fed to a three-layer MLP trained to predict the sixty relationship categories.

Modality: Text

Our Text model extends DialogRE [17], based on BERT and we ex- tract relationships between speakers from dialogues. We fine-tune the model and extend their smaller set of relations to 60relational categorieswithafullyconnectedlayers.Weuseourspeakerdiariza- tion input at scene levels and use data-augmentation by automated mapping of 36relation-categories to our 60relational categories by findingneighbors through their embeddings.

7 EVALUATION AND RESULTS

The ACM Grand challenge is based on three different question types on the HLVU dataset [4] - Type 1 requires us to findall valid paths from a given source to the target, for which only one correct solution exists. Therefore F1 scores are chosen as the evaluation metric. A path is considered to be correct only if all the relations and entities along the path match the ground truth. Type 2 is Fill in the graph space, where a list of entities and their relations to an unknown entity is given. The answer to this question is a list

of potential entities in descending order of prediction confidence. There are totally 60 symmetric relationships to infer from and the Mean Reciprocal Rank (MRR) is the evaluation metric. Type 3,

Multiple choice question answeringis evaluated by accuracy. 1The Mean Reciprocal Rank allows to capture more retrieval information

than F1 and accuracy metrics for Type 1 and Type 3 respectively, which require single solutions.

1For test movies’ F-1 scores in Table 1, we use Path-1 F1 values. For each movie in training set, we learn from each modality on 5 other train-movies and evaluate on the remaining one. For each test-movie, we train from all movies in training dataset

Statistics Evaluation

Text Text+Scene Human

#Actor #Speaker #Object Time

Type1 Type2 Type3 Type1 Type2 Type3 Type1 Type2 Type3

Honey

Nuclear Family Spiritual Contact Super Hero Huckleberry Finn Valkaama

10 10 12 86 min 4 4 5 28 min 10 10 13 66 min 7 7 12 18 min 10 10 20 106 min 7 7 13 93 min

0 41.7 0.1

0 37.5 0

0 0.0 0

0 25.0 0

25.0 16.7 0

0.0 100.0 0

0.0 52.1 0

25.0 16.7 0

25.0 6.3 0

25.0 58.3 0

- -
- -
100.0 22.2
1. - -

100.0 87.5 51.8 100.0 100.0 50.0

Shooters 8 8 11 41 min - - - 1.2 15.9 50.0 - - -

Let’s Bring Back Sophie 13 13 22 50 min - - - 0.0 16.7 50.0 - - -

The Big Something 9 9 12 101 min - - - 0.0 0.0 50.0 - - -

Time Expired 16 16 36 92 min - - - 0.0 0.0 50.0 - - - Table 1: Evaluation in percentages for six Train and four Test movies; Type 1 findspaths between entities (F1), Type 2 fillsin

missing graph information (MRR), and Type 3 findssemantic relations between nodes (Accuracy)

Human Ground Truth Evaluation

All training movies are seen by human at least twice and annotated manually in order to create queries and ground-truth for model evaluation2. We rarely see performance in the range of 80% MRR

for Type 2, and poor results for Type 3 questions, mostly owing to having too many semantic relation classes for a worker to process.

In human evaluation, workers use background music to better understand situation’s mood, and grasp conversational context and bodily gestures more readily. Humans have an implicit access to external datasets. Recent approaches focus on data representations but making deductions using multimodal free-form inputs is harder.

Result Analysis

Our system performs significantlybetter on question Type 2- infer- ence task (Table 1). For Type1, the paths and entities must match the ground truth in order. However, this leads to weaker perfor- mance if any entity pairs’ relations don’t match the ground truth. SincetheTextmodelhassoft-transferlearningonpre-trainedBERT, we extract a good dialog representation at scene level. For Type 1 questions, the fused model performs better, while the low recall is attributed to information loss due to mismatched/non-identifiable objects from face-object recognition. Imbalanced relationship cate- gories in training data posed a key challenge leading to significant bias. We evaluate the test set using the fused model only.

Type 1 questions were deemed hardest by the challenge authors. Our model performs better on test set when the paths are evaluated on relations grouped into five sets. For Shooters, the recall of 2.8% is higher than the precision of 0.075% implying there are fewer false negatives than false positives. We can reduce false positives across movies by encoding features attribute like age, gender, etc. that will help eliminate edges in the knowledge graph between two entities that logically doesn’t apply. Low recall is also attributed to:

1) We do not incorporate inverse relations, i.e. relations outside of the primary sixty categories. The paths between Robin’s father and Nicolein Let’s Bring Back Sophiecontain multiple inverse relations like Socialized At By. 1) When an important entity is not recognized in the pipeline, the body tracks for the kinetics model becomes sparse or

2The evaluation fileson training set are hand-crafted by workers for training movies to best capture performance, as opposed to recording vanilla model loss and accuracies

non-existent, and the model misses inferring their paths. This is evident in Shooterswhere Mrs Miltonshould connect edges in multiple paths from Isaac and Jaden.

3) When two entities are not co-located or are tracked rarely, our model either does not infer a relation between them or doessowithalowerconfidence,leadingtomissededges.For example, most paths between Emil Oryxand Sasha in Time Expiredshould contain Corinna Zimmerman, but is not co- locatedwithentitiesinthesuggestedpath.Onepossibleway to reduce missed co-located entities is to increase the frame thresholdinourmodelandintroducemorehyperparameters.

For Type2ranking questions, our model retrieves target entities from the knowledge graph based on the properties of associated edges for multiple queries. However, Type 2 suffersfrom the same problems described for Type 1. For longer movies, it generates too many sets of probabilities, one for each scene, that hinders entity- pair distribution from converging optimally. This explains the MRR scores for The Big Somethingand Time Expired.

In Type 3 questions, our model gets 100% accuracy for questions resembling “How many children/siblings does A have?” across all test movies. These comprises of 50% of the questions in Type 3, therefore we get an accuracy of 50% or more across all movies.

8 CONCLUSION AND FUTURE WORK

We found that segmentation in storylines helped a lot, text-based embeddings are relatively easy to adapt despite the scarcity of training samples in our movie scenes, and audio-emotion embed- dings were not effective in detecting critical moments in a movie. Scene based kinetics were very effective in producing large training samples for training and helping create a powerful model. We can improve our character entity co-location pipeline by performing object tracking on each character’s occurrence.

A future direction of research on semantic deduction could be based on crisper speaker diarization to prevent garbled transcrip- tion, inferring morphological segregation of multilingual conver- sations [14] and reduce false positive speaker associations using head-tracking as a better proxy for person-face mapping during semantically-deduced sample creation, since freely occurring multi- modal data rarely have faces oriented towards the recording device.

REFERENCES

Chris Biemann. 2006. Chinese whispers-an efficientgraph clustering algorithm

and its application to natural language processing problems. In Proceedings of TextGraphs: the First Workshop on Graph Based Methods for Natural Language Processing. 73–80.

FabianCabaHeilbron,VictorEscorcia,BernardGhanem,andJuanCarlosNiebles. 2015. Activitynet: A large-scale video benchmark for human activity understand- ing.InProceedingsoftheieeeconferenceoncomputervisionandpatternrecognition. 961–970.
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
Keith Curtis, George Awad, Shahzad Rajput, and Ian Soboroff.2020. HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do. In Proceedingsofthe2020InternationalConferenceonMultimediaRetrieval.355–361.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedingsofthe2019ConferenceoftheNorthAmericanChapteroftheAssociation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and ShortPapers).AssociationforComputationalLinguistics,Minneapolis,Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Mur- phy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge vault: Aweb-scaleapproachtoprobabilisticknowledgefusion.In Proceedingsofthe20th ACM SIGKDD international conference on Knowledge discovery and data mining. 601–610.
Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In 2017ieeeinternationalconferenceonacoustics,speechandsignalprocessing(icassp). IEEE, 131–135.
Natthawut Kertkeidkachorn and Ryutaro Ichise. 2017. T2KG: An end-to-end system for creating knowledge graph from unstructured text. In Workshops at

the Thirty-First AAAI Conference on ArtificialIntelligence.

Davis E King. 2009. Dlib-ml: A machine learning toolkit.The Journal of Machine Learning Research10 (2009), 1755–1758.
Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S. Kankanhalli. 2017. Dual- Glance Model for Deciphering Social Relationships. arXiv:1708.00634 [cs.CV]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector.

In European conference on computer vision. Springer, 21–37.

X. Liu, W. Liu, M. Zhang, J. Chen, L. Gao, C. Yan, and T. Mei. 2019. Social Relation Recognition From Videos via Multi-Scale Spatial-Temporal Reasoning.

In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3561–3569.

Daniel Rotman, Dror Porat, Gal Ashour, and Udi Barzelay. 2018. Optimally grouped deep features using normalized cost for video scene detection. In Pro- ceedings of the 2018 ACM on International Conference on Multimedia Retrieval. 187–195.
Efsun Sarioglu Kayi, Vishal Anand, and Smaranda Muresan. 2020. MultiSeg: Par- allel Data and Subword Information for Learning Bilingual Embeddings in Low Resource Scenarios. InProceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Comput- ing for Under-Resourced Languages (CCURL). European Language Resources asso- ciation, Marseille, France, 97–105. https://www.aclweb.org/anthology/2020.sltu- 1.13
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815–823.
Qianru Sun, Bernt Schiele, and Mario Fritz. 2017. A domain based approach to social relation recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3481–3490.
Dian Yu, Kai Sun, Claire Cardie, and Dong Yu. 2020. Dialogue-Based Relation Extraction. arXiv preprint arXiv:2004.08056(2020).
LuoweiZhou,YannisKalantidis,XinleiChen,JasonJCorso,andMarcusRohrbach. 2019. Grounded video description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6578–6587.