

The α parameter roughly means “percent of words in sentence changed by each augmentation.” SR: Synonym Replacement, RI: Random Insertion, RS: Random Swap, RD: Random Deletion. This dataset will henceforth be referred to as the unbalanced dataset, and it is the dataset on which I received the best empirical results for this specific problem.Īverage performance gain over five text classification tasks for different training set sizes (N). Keeping this in mind, I attempted to find the right balance by undersampling the modified dataset until toxic comments made up ~20% of all training data. But if we undersample too little (or not at all), the model’s predictions might bias towards the majority class and be unable to predict the minority class.

If we undersample too much, we risk hurting model performance by losing out on valuable training data. It is important to note, however, that a fine balance must be met when undersampling the majority class. ( Note: Make sure to split your data beforehand and only oversample the training set to ensure your evaluation results remain as unbiased as possible!) 2.1) The ‘Unbalanced’ Dataset To deal with this, we will implement a combination of both undersampling and oversampling to balance out our class distribution. With that out of the way, let’s get started! 2.0) The Dataįor this project, I will be classifying whether a comment is toxic or non-toxic using personally modified versions of the Jigsaw Toxic Comment dataset found on Kaggle (I converted the dataset from a multi-label classification problem to a binary classification problem). Compared to its older cousin, DistilBERT’s 66 million parameters make it 40% smaller and 60% faster than BERT-base, all while retaining more than 95% of BERT’s performance.² This makes DistilBERT an ideal candidate for businesses looking to scale their models in production, even up to more than 1 billion daily requests! And as we will see, DistilBERT can perform quite admirably with the proper fine-tuning.

But I chose DistilBERT for this project due to its lighter memory footprint and its faster inference speed.
#FINETUNE SYNONYM HOW TO#
In this article, I would like to share a practical example of how to do just that using Tensorflow 2.0 and the excellent Hugging Face Transformers library by walking you through how to fine-tune DistilBERT for sequence classification tasks on your own unique datasets.Īnd yes, I could have used the Hugging Face API to select a more powerful model such as BERT, RoBERTa, ELECTRA, MPNET, or ALBERT as my starting point. Sourceīut what if a company doesn’t have the resources needed to train such large behemoths? Well, thanks to recent advances in transfer learning (the technique was previously well established in Computer Vision and only recently found applications in NLP), companies can more easily achieve state-of-the-art performance by simply adapting pre-trained models for their own natural language tasks. Size of recent natural language models in millions of parameters.
