- The AI Solopreneur
- Posts
- How UniNer out-performs Gpt-3.5-turbo with 90% smaller size.
How UniNer out-performs Gpt-3.5-turbo with 90% smaller size.
Because UniNER is comparatively small, anyone can train it for a really low price.
On Aug 8, the University of Southern California and Microsoft released a paper titled UniNER: A Universal Named Entity Recognition(UniNER) Model which outperforms ChatGpt in all datasets tested.
Before we go ahead with analyzing how UniNer does this and how this can be helpful to us, let’s first understand all the keywords in this article.
GLOSSARY
Named Entity Recognition
NER is a pre-processing step for most Natual Language Processing (NLP) problems where the named entity information is extracted from the given text. An entity can be the name of a place, object, person, etc. and each entity has a type. for eg. Los Angeles {entity} is a city {type}. NER is used to detect all entities in a text that belong to each type.
NER has many applications like a chatbot’s understanding of the intent of the query, extracting certain data from documents, for search engines surfacing key entities, and improving the NER performance of a model will improve its performance across various tasks.
Knowledge Distillation[1]
Knowledge distillation is a machine learning strategy to distill knowledge from a trained large teacher model to an untrained small student model which can result in comparative performance of the student network, with lesser training and lower parameters. Alpaca, Vicuna, and now UniNER are good examples of student models which were trained using knowledge distillation.
It is a different approach for tuning using instructions rather than normal datasets. The training data contains 3 parts:
Instruction: "Translate this sentence to Hindi"
Input: "Hi, How are you?"
output: “Namaste, Kaise ho aap?”
The models are trained to generate the desired output based on the input-instruction combination. This has been proven to be more generalizable than traditional fine-tuning
Now, let’s dive deep into the paper. We already have many models that have been instruction tuned and/or have distilled knowledge from a larger teacher model. But none of them have performed anywhere close to it. The performance of student models trails the teacher model by a significant margin, especially in targeted downstream applications. Models like Alpaca, and Vucaan can imitate ChatGpt in a few cases but their performance is not comparable to it for downstream applications. This is understandable given the limited compute budgets for distillation which restricts producing an accurate approximation across all possible applications.
The key innovation in UniNER is the use of mission-focused instruction tuning to train student models for a broad application class - in this case, open-domain NER. Instead of relying on generic instructions, the authors devise a tuning recipe specifically targeted to excel at NER across diverse entity types and domains.
It is important to understand the uniqueness of UniNER’s training process compared to other previous LLMs. Traditionally, in instruction tuning, the instructions are diversified to allow the model to perform various tasks on a small dataset.
for eg, let’s assume that the small dataset is the harry potter series. other instruction tuned LLMs can do all various tasks like summarizing, translation, summarizing, paraphrasing, and chatting but only on harry potter books. Their performance drops significantly when they are tested on these tasks on other datasets.
But with UniNER, the instruction remains the same for all the 43 datasets that is “to recognize entities and classify them based on the types” but the input becomes diversified. Because NER is a broader more coarse task and has many other downstream applications, the authors believe that if a model is trained to perform well on NER, it can be used for all the respective downstream applications.
So I have mentioned that input is diversified. So, for instruction tuning how are the inputs generated for 43 datasets across 9 domains like programming, medical condition, technology etc? It would take months to do it manually. They used ChatGPT (gpt-3.5-turbo-0301) to generate these inputs for them.

Zhou, W., Zhang, S., Gu, Y., Chen, M., & Poon, H. (2023). UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition. ArXiv. /abs/2308.03279
So, they would create these short passages from all the datasets and have used the above prompt to ask ChatGPT to perform NER for them.
After cleaning this generated data, the dataset comprises of 45,889 input-output pairs, encompassing 240,725 entities and 13,020 distinct entity types.

Zhou, W., Zhang, S., Gu, Y., Chen, M., & Poon, H. (2023). UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition. ArXiv. /abs/2308.03279
The most frequent entity types are in the table above.
But The problem with the above technique is even though they have not mentioned any specific entity types and let chatGPT choose its own entities, the model that is tuned with this data becomes sensitive to entity types and may fail if they are paraphrased. So, to test this out, they have changed the prompt to “extract entities and define their types using short sentences”. This method generated a much more diverse set of 353,092 entity types.
Unlike older models where the student models are trained with changed loss function and on the soft targets generated by the teacher model on the same dataset, UniNER distills knowledge from ChatGPT by directly training on data generated by it.
In a zero-shot setting, remarkably, without any human-labeled data, UniNER surpasses ChatGPT's NER performance by 7-9 F1 points on average with 90% lesser parameters. That’s right. got-3.5-turbo has around 135 billion parameters whereas LLAMA, which was used for UniNET, has 13B parameters.
UniNER also outperforms the previous best instruction-tuned models like Vicuna by over 30 F1 points, the previous SoTA for open-source zero-shot NER models.

Zhou, W., Zhang, S., Gu, Y., Chen, M., & Poon, H. (2023). UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition. ArXiv. /abs/2308.03279
The above figure compares UniNER, Vicuna, and chatGPT across 9 domains of text in zero-shot setting and UniNER outperformed chatGPT in every one of those.
“Alright, abhiram. we get it. UniNER is so good. but how does this benefit us?“
This has useful implications for democratizing access to large language model capabilities. With targeted tuning recipes like UniNER, it may be feasible to create customizable LLM-based models for specific use cases without prohibitively expensive training of full-sized models. Companies and developers could potentially tune compact models using their own proprietary data and tasks.
Because UniNER is open-access and comparatively small, you can fine-tune it for your use case and data by spending less than $350 on computing.
Now, the authors have said that they will be releasing all the code to the public. Right now, the model weights and inference code is open but the fine-tuning and evaluation code will be released soon.
By open-sourcing the training code and data, this work opens exciting avenues for creating accurate and affordable LLMs for real-world applications.
But there are two problems I see here.
Now because GPT-3.5 is close-sourced they can only use its APIs for evaluation. The original model might perform. better than its API and UniNER. But that’s a big question that will be unanswered.
They have published the results of only one evaluation metric. Though the F-1 score is the apt metric for this kind of problem, seeing UniNER’s performance on other metrics would have been more insightful.
Overall UniNER demonstrates the promise of targeted distillation to make large language model prowess more accessible.
\
Reply