<?xml version="1.0" encoding="UTF-8"?><xml><records><record><source-app name="Biblio" version="7.x">Drupal-Biblio</source-app><ref-type>10</ref-type><contributors><authors><author><style face="normal" font="default" size="100%">Joan Serrà</style></author><author><style face="normal" font="default" size="100%">Ilias Leontiadis</style></author><author><style face="normal" font="default" size="100%">Dimitris Spathis</style></author><author><style face="normal" font="default" size="100%">Gianluca Stringhini</style></author><author><style face="normal" font="default" size="100%">Jeremy Blackburn</style></author><author><style face="normal" font="default" size="100%">Athena Vakali</style></author></authors></contributors><titles><title><style face="normal" font="default" size="100%">Class-based Prediction Errors to Categorize Text with Out-of-vocabulary Words</style></title><tertiary-title><style face="normal" font="default" size="100%">ALW1'17</style></tertiary-title></titles><dates><year><style  face="normal" font="default" size="100%">2017</style></year></dates><pub-location><style face="normal" font="default" size="100%">Vancouver, Canada</style></pub-location><language><style face="normal" font="default" size="100%">eng</style></language><abstract><style face="normal" font="default" size="100%">&lt;p&gt;Common approaches to text categorization essentially rely either on n-gram counts or on word embeddings. This presents important difficulties in highly dynamic or quickly-interacting environments, where the appearance of new words and/or varied misspellings is the norm. A paradigmatic example of this situation is abusive online behavior, with social networks and media platforms struggling to effectively combat uncommon or non-blacklisted hate words. To better deal with these issues in those fast-paced environments, we propose using the error signal of class-based language models as input to text classification algorithms. In particular, we train a next-character prediction model for any given class, and then exploit the error of such class-based models to inform a neural network classifier. This way, we shift from the ability to describe seen documents to the ability to predict unseen content. Preliminary studies using out-of-vocabulary splits from abusive tweet data show promising results, outperforming competitive text categorization strategies by 4–11%.&lt;/p&gt;
</style></abstract></record></records></xml>