Abstract
Abstract
This paper provides data resources for low-resource hate speech detection. Specifically, we introduce two different data resources: (i) the HateBR 2.0 corpus, which is composed of 7,000 comments extracted from Brazilian politicians’ accounts on Instagram and manually annotated a binary class (offensive versus non-offensive) and hate speech targets. It consists of an updated version of the HateBR corpus, in which highly similar and one-word comments were replaced; and (ii) the multilingual offensive lexicon (MOL), which consists of 1,000 explicit and implicit terms and expressions annotated with context information. The lexicon also comprises native-speaker translations and its cultural adaptations in English, Spanish, French, German, and Turkish. Both corpus and lexicon were annotated by three different experts and achieved high inter-annotator agreement. Lastly, we implemented baseline experiments on the proposed data resources. Results demonstrate the reliability of data outperforming baseline dataset results in Portuguese, besides presenting promising results for hate speech detection in different languages.
Publisher
Cambridge University Press (CUP)
Reference86 articles.
1. Bag of Tricks for Efficient Text Classification
2. Warner, W. and Hirschberg, J. (2012). Detecting hate speech on the world wide web. In Proceedings of the 2nd Workshop on Language in Social Media, Montréal, Canada, pp. 19–26.
3. Human-Machine Collaboration Approaches to Build a Dialogue Dataset for Hate Speech Countering
4. Hate speech detection in the Indonesian language: A dataset and preliminary study
5. Eyheramendy, S. , Lewis, D. D. and Madigan, D. (2003). On the Naive Bayes model for text categorization. In Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics, Florida, USA, pp. 93–100.