This repository contains additional information on the analyses in our Coling2018-survey "Multimodal Grounding for Language Processing".
Abstract: This survey discusses how recent developments in multimodal processing facilitate conceptual grounding of language. We categorize the information flow in multimodal processing with respect to cognitive models of human information processing and analyze different methods for combining multimodal representations. Based on this methodological inventory, we discuss the benefit of multimodal grounding for a variety of language processing tasks and the challenges that arise. We particularly focus on multimodal grounding of verbs which play a crucial role for multimodal compositionality.
Please use the following citation:
@InProceedings{beinborn2018multimodal,
title = {{Multimodal Grounding for Language Processing}},
author = {Beinborn, Lisa and Botschen, Teresa and Gurevych, Iryna},
publisher = {Association for Computational Linguistics},
booktitle = {Proceedings of COLING 2018, the 27th International Conference on Computational Linguistics: Technical Papers},
pages = {to appear},
month = {aug},
year = {2018},
location = {Santa Fe, USA},
}
Contact person: Teresa Botschen (botschen@aiphes.tu-darmstadt.de), Lisa Beinborn (lisa.beinborn@uni-due.de)
https://www.ukp.tu-darmstadt.de/
Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.
This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.
We present first steps towards an investigation of verb grounding and analyze the quality of verb representations in the most common publicly available approaches for multimodal representations.
This project uses different pretrained word embeddings which can be found here: https://fileserver.ukp.informatik.tu-darmstadt.de/coling18-multimodalSurvey
Run the script 'coling18-multimodalSurvey_experiments.py' to step-by-step reproduce the results reported in paper. Use the datasets and embeddings mentioned below.
We used the following resources to create multimodal representations.
Google dataset with existing image embeddings as provided by Kiela et al. (2016) (paper: https://aclweb.org/anthology/D/D16/D16-1043.pdf, data: http://www.cl.cam.ac.uk/~dk427/cnnexpts.html) (images in Google dataset: obtained by Google image search, embeddings for Google dataset: obtained by using GoogLeNet)
visual Google representations: 1024 dim
imSitu dataset by Yatskar et al. (2016) (paper: https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Yatskar_Situation_Recognition_Visual_CVPR_2016_paper.pdf, data: http://imsitu.org/)
--> We trained the embeddings by applying a pre-trained VGG19 neural network for image classification on the visual resources. (pretrained network: Simonyan and Zisserman (2014) (paper: https://arxiv.org/pdf/1409.1556.pdf))
visual imSitu representations: 4096 dim
Glove embeddings by Pennington et al. (2014) (paper: http://aclweb.org/anthology/D14-1162)
Glove representations: 300 dim
We mapped from textual to visual embeddings by applying the mapping method by Collell et al. (2017) (paper: http://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/14811/14042)
Mapped embeddings for Google dataset: 1024 dim
Mapped embeddings for imSitu dataset: 4096 dim
In line with previous work, the quality of the representations is evaluated as the Spearman correlation between the cosine similarity of two verb embeddings and their corresponding similarity rating in the SimVerb dataset. We compare the quality of 3498 verb pairs.
SimVerb dataset by Gerz et al. (2016) (paper: http://www.aclweb.org/anthology/D16-1235)
From a multimodal perspective, verbs can be categorized according to their degree of embodiment. This measure indicates to which extent verb meanings involve bodily experience. We obtain embodiment ratings for 1163 pairs. The class 'high embodiment' contains pairs like 'fall-dive' in which the embodiment of both verbs can be found in the highest quartile (135 pairs), 'low embodiment' contains pairs with embodiment ratings in the lowest quartile (81 pairs) like 'know-decide'.
Embodiment ratings for verbs by Sidhu et al. (2014) (paper: http://iranarze.ir/wp-content/uploads/2017/08/7298-English-IranArze.pdf)