Adversarial training can provide neural networks with significantly improved resistance to adversarial attacks, thus improving model robustness. However, a major drawback of many existing adversarial training workflows is the computational cost and extra processing time when using data augmentation techniques. This post explores the application of embedding perturbations via the fast gradient method (FGM) when finetuning large language models (LLMs) to short text classification tasks. This adversarial training approach has been evaluated as part of the first sub-task of SemEval 2023-Task 10, focused on explainable detection of sexism in social networks (EDOS). Empirical results show that adversarially finetuned models with FGM had on average a 25% longer training time and 0.2% higher F1 than their respective baselines.


Social media is ubiquitous in everyday communications and a place where users express freely their views and thoughts. However, while most users make a fair use of social media platforms, the presence of detrimental and undesirable content that can be deemed abusive towards other users or communities has drawn lot of attention lately. The threat this poses to online safety has sparked different content moderation strategies than the traditionally used for fighting spam or phishing, since in this case they have to deal with legit users occasionally posting messages using toxic, sexist or abusive language instead of financially motivated cybercrime.

The use of Natural Language Processing (NLP) to detect, and assess sexist content at scale is widespread, however this is far from being considered a solved problem. The lack of fine-grained classification and poor interpretability have been highlighted as current shortcomings for this task. The Explainable Detection of Online Sexism (EDOS) workshop organized as part of SemEval 2023 aims to bridge the above-mentioned gaps in sexist language detection for the English language.

In addition to the identified challenges there are other nuances common to many other language-based moderation systems. Social media publications can be short and subject to interpretation without enough context. Likewise, the presence of slang, spelling variations and purposely obfuscated content can lower the confidence of NLP tools (Mosquera and Moreda, 2012) and evade detection (Mosquera, 2022a). Finally, labeling large training sets can be expensive since it requires a pool of annotators with expert knowledge (Vidgen and Derczynski, 2021). Therefore, training sets with only a few thousands of positive examples and high class imbalance are not uncommon, which can affect model robustness.


The EDOS challenge provided a training dataset (Kirk et al., 2023) of 14000 examples with finegrained classifications for sexist posts written in English extracted from Gab and Reddit social networks. Out of these, there were only 3398 entries labeled as sexist, thus existing a substantial class imbalance for the proposed sub-task A, which was binary classification of sexist content.

There is extensive previous work tackling the detection of sexist content in social media using pre-trained models, either as an standalone task (Rodríguez-Sánchez et al., 2020; Abburi et al., 2021) or as a subset of toxic language classification (Park and Fung, 2017; Pitsilis et al., 2018).

Some instances of sexist language can be subtle and difficult to identify in absence of proper context (Swim et al., 2004), specially when working with short texts such as Tweets or isolated sentences part of a larger conversation. Recent attempts to improve generalization and robustness in these cases usually involve ensembles (Davies et al., 2021), adversarial training (Samory et al., 2020) or data augmentation (Butt et al., 2021). However, these techniques are not foolproof, and in addition to the extra training cost they can introduce unintended bias (Sen et al., 2022).


For these reasons, in this post we approach the EDOS binary classification task by automatically finetuning and ensembling sets of different large pre-trained models. The Fast Gradient Method (FGM), an intuitive backpropagation based method to generate adversarial samples (Szegedy et al., 2013), is used during training in order to improve model robustness.

The submission produced by the proposed system obtained a 85.6% F1 and ranked 15th out of 84 competing systems. The main takeaway is that adversarial training can be a viable strategy under domain (text size) or cost (training time) constraints in comparison with other more resource-intensive techniques such as text augmentation (Mosquera, 2022b).




Harika Abburi, Shradha Sehgal, Himanshu Maheshwari, and Vasudeva Varma. 2021. Knowledge-based neural framework for sexism detection and classification. In IberLEF@ SEPLN, pages 402–414.

Sabur Butt, Noman Ashraf, Grigori Sidorov, and Alexander Gelbukh. 2021. Sexism identification using bert and data augmentation — exist2021. CEUR Workshop Proceedings, 2943:381–389. Publisher Copyright: © 2021 CEUR-WS. All rights reserved.; Conference date: 21–09–2021.

Lily Davies, Marta Baldracchi, Carlo Alessandro Borella, and Konstantinos Perifanos. 2021. Transformer ensembles for sexism detection.

Hannah Rose Kirk, Wenjie Yin, Bertie Vidgen, and Paul Röttger. 2023. SemEval-2023 Task 10: Explainable Detection of Online Sexism. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). Association for Computational Linguistics.

Takeru Miyato, Andrew M. Dai, and Ian Goodfellow. 2016. Adversarial training methods for semisupervised text classification.

Alejandro Mosquera. 2022a. Alejandro mosquera at politices 2022: Towards robust spanish author profiling and lessons learned from adversarial attacks. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2022) co-located with the Conference of the Spanish Society for Natural Language Processing (SEPLN 2022), A Coruña, Spain, September 20, 2022, volume 3202 of CEUR Workshop Proceedings. CEUR-WS.org.

Alejandro Mosquera. 2022b. Amsqr at SemEval-2022 task 4: Towards AutoNLP via meta-learning and adversarial data augmentation for PCL detection. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 485–489, Seattle, United States. Association for Computational Linguistics.

Alejandro Mosquera and Paloma Moreda. 2012. Tenor: A lexical normalisation tool for spanish web 2.0 texts. In Text, Speech and Dialogue, pages 535–542, Berlin, Heidelberg. Springer Berlin Heidelberg.

Ji Ho Park and Pascale Fung. 2017. One-step and twostep classification for abusive language detection on twitter.

Georgios K. Pitsilis, Heri Ramampiaro, and Helge Langseth. 2018. Effective hate-speech detection in twitter data using recurrent neural networks. Applied Intelligence, 48(12):4730–4742.

Francisco Rodríguez-Sánchez, Jorge Carrillo-de Albornoz, and Laura Plaza. 2020. Automatic classification of sexism in social networks: An empirical study on twitter data. IEEE Access, 8:219563–219576.

Mattia Samory, Indira Sen, Julian Kohne, Fabian Floeck, and Claudia Wagner. 2020. “call me sexist, but…”: Revisiting sexism detection using psychological scales and adversarial samples.

Indira Sen, Mattia Samory, Claudia Wagner, and Isabelle Augenstein. 2022. Counterfactually augmented data and unintended bias: The case of sexism and hate speech detection.

Janet K Swim, Robyn Mallett, and Charles Stangor. 2004. Understanding subtle sexism: Detection anduse of sexist language. Sex roles, 51:117–128.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks.

Bertie Vidgen and Leon Derczynski. 2021. Directions in abusive language training data, a systematic review: Garbage in, garbage out. PLOS ONE, 15(12):1–32.



Alejandro Mosquera

Kaggle Grandmaster. Researcher in AI, Cyber Security, Machine Learning, NLP. Opinions are my own. www.amsqr.com www.alejandromosquera.net