Building Text Classifiers: A New Approach with Less Data
June 20, 2024TL;DR: In this article, we demonstrate a no-code approach to building a text classifier that not only outperforms Large Language Models (LLMs) such as OpenAI and Cohere, but also does so with just a handful of labeled data. If you're eager to see the results, feel free to jump straight to the Experiments section.
Note: This article is designed for a general audience, especially for those who may not have a technical background. We aim to make the content accessible and easy to understand for everyone, regardless of their expertise in the field.
Text classification, a technique in machine learning, is a powerful tool that assigns predefined categories to open-ended text. It's a bit like sorting mail into different boxes based on the type of mail. You could have a box for bills, another for personal letters, and another for promotional materials. Similarly, text classification sorts text into different categories, making it easier to manage and analyze large amounts of text data. This technique is used in a variety of contexts, from classifying tweets and headlines to organizing customer reviews and news articles. Some of the most common applications include sentiment analysis, topic labeling, language detection, and intent detection.
But here's the real game-changer: we're going to build a text classifier that not only outperforms closed APIs like Cohere and OpenAI, but also achieves this superior performance with just a few labeled examples. Yes, you read that right. We're going to do all this without writing a single line of code. This approach is a testament to the power of leveraging small amounts of labeled data to achieve big results.
The Power of Text Classification
Text classification is a powerful tool that can transform the way we handle and analyze large amounts of text data. It's like having a Swiss Army knife for data management, with a multitude of uses that can be applied to a wide range of tasks.
One of the most compelling applications of text classification is in the realm of customer service. Imagine a customer writes in asking about refunds. With text classification, you can automatically assign the ticket to the teammate who handles refunds, ensuring a quick and efficient response. It's like having a virtual assistant that never sleeps.
But the power of text classification doesn't stop there. With the advent of few-shot learning, we can now train a model with just a handful of labeled examples and yet achieve performance comparable to models trained on a full dataset. This is a game-changer, especially in scenarios where labeled data is scarce or expensive to obtain.
In essence, text classification, when combined with few-shot learning, can revolutionize the way we handle text data, making it a cost-effective and efficient solution for a wide range of applications. Whether it's improving customer service, guiding business strategies, enhancing security, or boosting research productivity, the potential of this approach is immense.
Building a Better Text Classifier
Building a quality machine learning model for text classification can be a bit like assembling a jigsaw puzzle. You need to define the tags you will use, gather data for training the classifier, and tag your samples. Once you have defined your tags, the next step is to obtain text data. These are the texts that you want to use as training samples and that are representative of future texts that you would want to classify automatically with your model.
In recent years, there has been a growing interest in Generative AI and Large Language Models (LLMs) like ChatGPT. These models have the ability to generate human-like text and can be fine-tuned for specific tasks, such as text classification. One of the key advantages of LLMs is that they can perform well with a limited amount of labeled data, thanks to their ability to leverage pre-training on large amounts of unlabeled data. This makes them particularly useful in scenarios where labeled data is scarce or expensive to obtain.
However, while LLMs have their benefits, they are not without their drawbacks. They can be computationally expensive and time-consuming to train, and their performance can sometimes be unpredictable due to their black-box nature.
This is where few-shot learning comes into play. Few-shot learning is a technique that allows you to train a model with just a few examples, yet it performs as well as a model that has been trained on a full dataset. This approach is highly efficient, versatile, and can significantly reduce data collection effort and computational costs.
So, whether you're dealing with a mountain of unlabeled data or just a molehill, few-shot learning can help you build a powerful text classifier that's ready to tackle your text classification tasks. In comparison to LLMs, these methods offer a more efficient and cost-effective solution for text classification, without compromising on performance.
How to build Text Classifer on Mazaal
Step 1: Create Project
Click on "Train" from the left sidebar and "Create a Project".
Name your project and select your data type and your objective, in this case, Classification.
In the next section, upload your data by clicking on "Browse" and select your dataset.
Once you've uploaded the dataset, set the labels that you want to categorize into.
Then start labeling your text one by one. To speed up this process, you can simply use keyboard 1 - for the first category, 2 - for the second category, and so on.
Once labeling is finished, click on "Create Machine" and "Train" to start the training process.
After the training process, you should be able to test the model right away.
The Proof is in the Pudding: Experimental Results in English
Let's take a look at the performance of various models in English text classification. The models were trained on different datasets, and the results show the training time, inference time, cost, F1 score, and accuracy for each model. The results for 8, 16, and 32 labels per class are reported here.
We selected the following dataset for rapid experimentation:
Name: SentEval - https://paperswithcode.com/dataset/senteval
Result
Mazaal AI outperforms other models in terms of cost-efficiency, competitive performance, scalability with limited data, and speed. It's 10 times more cost-efficient than ChatGPT, achieves a better F1 score and accuracy, and performs well even with smaller datasets. Additionally, Mazaal AI is significantly faster, taking approximately 12.8 times less time (when trained on 32 samples per class) compared to ChatGPT. This speed advantage becomes even more pronounced as the dataset size increases, leading to a wider gap in both speed and cost efficiency.
Experimental Results in Low Resource Language
Before diving into the experimental results in low-resource languages, it's important to mention the #BenderRule. This rule encourages researchers to always name the languages they are working on and highlights the importance of studying low-resource languages to truly test NLP performance, in this case, text classification. By focusing on these languages, we can better understand the limitations and capabilities of NLP models and develop more robust and inclusive solutions. With this in mind, let's explore the results of our text classification experiments in a low-resource language.
Dataset
Name: 11-11 dataset
11-11.mn is a website designed for connecting Mongolian government agencies with the public. Anyone can issue a ticket for a complaint, criticism, or simply a request, then it will be forwarded to a specific government agency.
The dataset has 80036 records from 2012–10–13 to 2018–11–12(6 years) - probably the biggest labeled dataset that can be found in the Mongolian language. However, the size of the Mongolian Wikipedia content is around only 3GB - which is where most LLMs get source data from. This makes it the main reason why we decided to pick this language and run experiments on.
When we run the experiments on OpenAI ChatGPT and Cohere, we actually provided prompts in few-shot style so that it's more fair to compare given that both models are receiving similar data being provided at training. We included experiments with Cohere since they claim to work with more than 100 languages. Since there was a token limit on the input, we only included 8 examples per class with Cohere while with ChatGPT we used gpt-3.5-16k which was able to contain all the input examples as prompt.
Below is the example prompt we used with OpenAI and Cohere:
Here, I am providing {num_class*example_per_class} examples of sentence and categories. Classify following sentence to one of the following 4 categories and result should be only the category without double quote not anything else: complaint, criticism, request, gratitude.
sentence: example_sentence_1 category: example_category_1
sentence: example_sentence_2 category: example_category_2
……
sentence: input_sentence_1 category:
The results for text classification in a low resource language are shown below. The F1-score and accuracy for each class are reported for all models.
Based on the above table, here are the key advantages of Mazaal AI over other models in terms of cost-efficiency, performance, scalability with limited data, and speed:
1. Cost-efficiency: Mazaal AI is significantly more cost-effective than other models. For instance, it is approximately 690 times more cost-efficient than ChatGPT when trained on 32 samples per class.
2. Performance: Mazaal AI achieves higher F1 scores and accuracy rates across all class sizes. For example, it achieves about 13 point higher F1 score and 6% higher accuracy than ChatGPT when trained on 32 samples per class.
3. Scalability with limited data: Mazaal AI performs well even when trained on a smaller dataset. When trained on just 8 samples per class, Mazaal AI achieves about 1 point higher F1 score and 2% higher accuracy than ChatGPT.
4. Speed: Mazaal AI is significantly faster than other models. When considering both training and inference time, Mazaal AI takes approximately 16 times less time than ChatGPT when trained on 32 samples per class. This speed advantage becomes even more pronounced as the dataset size increases.
Wrapping Up
In the realm of text classification, the landscape is rapidly evolving, and the power of this tool is being harnessed in increasingly innovative ways. As we've seen, it's possible to build a text classifier that not only outperforms closed APIs like Cohere and OpenAI, but does so with just a handful of labeled data. This is a testament to the power of leveraging small amounts of labeled data to achieve big results.
Our experimental results have demonstrated the competitive edge of Mazaal AI in both English and low-resource language text classification tasks. But the capabilities of Mazaal AI extend even further. With support for over 100 languages, Mazaal AI is not just a tool, but a global solution, ready to tackle text classification tasks in a multitude of languages and contexts.
So, whether you're dealing with a mountain of text data or just a molehill, remember: there's a better, more efficient, and more inclusive way to sort it all out. With Mazaal AI, you're not just classifying text, you're unlocking the potential of your data, and opening up a world of possibilities.
In the end, the power of text classification is not just in sorting and organizing data, but in the insights it can provide, the efficiencies it can create, and the doors it can open. And with Mazaal AI, that power is in your hands.
On the second part of this article, we'll deep dive into actual algorithms and concepts.