Skip to content

Implemented projects from noncontextual word embeddings like Word2Vec to contextual word embeddings like ELMO, GPT, BERT to solving NLP tasks Sentence Level Classification (sentimental analysis & Toxic comment classification), Token Level Classification (POS Tagging, NER Tagging), Machine Translation (MT)

Notifications You must be signed in to change notification settings

khetansarvesh/NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

621 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

$\color{cyan}{1.\ Representation\ Learning\ (pretraining) }$

We need to represent language mathematically i.e. given a corpus you need to convert this corpus into its numerical form. This mathematical representation is called an embedding/context and the process is called representation learning. Why do this?? Because computers understand only numbers and not texts. We can do this in several ways:

So since we have so many methods to convert our corpus into numerical representation then which one should we use?? It really depends on the data that you are trying to convert

  • if it is social media data then maybe character embedding will work better
  • if your language is Chinese then maybe word embedding will work really really badly because in the Chinese language, there is no gap between words so identifying words is a big challenge hence in the Chinese language character embedding works really well and even subword embedding works really well
  • In languages like French and Arabic, though there are spaces but if say we have two words in English “so said” then it is just one word in these languages, hence word embedding might not work well with these languages

Finally before you start building your own LLLM from scratch, I would recommend reading this amazing blog from HF which help you understand how to find the right architecture and how are big scalable models built !!

$\color{cyan}{2.\ Downstream\ NLP\ (posttraining) }$

$\color{red}{2.A]\ Supervised\ Fine\ Tuning\ (SFT) }$

  • Non Generative (Classification) Tasks - Natural Language Understanding (NLU) Tasks

    • Sentence level classification Tasks
    • Token / Word level classification Tasks (also called Sequence Labeling/ Learning/ Tagging Task) : Word / Token level classification problem is a problem in which you classify each word / token in the corpus as something. It has been observed that a Masked Language Model based base model gives better results compared to the next subword prediction based base model on NLU tasks. There can be numerous word level classification tasks, some of those are
  • Generative Tasks / Natural Language Generation (NLG) Tasks / Text2Text Tasks / Sequence2Sequence (Seq2Seq or S2S) Tasks / Sequence2Sequence Learning : This is a family of tasks in which we generate new sentences using input sentences. It has been observed that a next subword prediction based foundation model gives better results compared to Masked Language Model based foundation model on NLG tasks. These include tasks like

  • Instruction Tuning Task

$\color{red}{2.B]\ Reinforcement\ Learning\ Based\ Fine\ Tuning}$

Now above we saw how to finetune a foundational LLM model for different downstream tasks using SFT but for all those tasks we can also finetune a foundation model using RL. There are two ways to use RL to finetune :

  • Manual Reward Funtion Based RL : Here we will see how to finetune using RL if you 'can design' a good reward function for your downstream task.
  • (preferred)Automatic Reward Function Based RL : Here we will see how to finetune using RL if you 'cannot design' a good reward function for your downstream task, via the help of preference dataset. There are multiple methods to do this :
    • Reinforcement Learning using Human Feedback (RLHF) : trains a seperate reward model
    • (preferred) Direct Preference Optimization (DPO) : uses base LLM itself as reward model

Important

It has been proved previously that its better to first finetune LLM on any task using SFT and then finetune on the same task using RL, it gives better outcomes. This notebook from UnSloth follows the below recepie to convert Qwen3 from a non reasoning model to a reasoning model.

$\color{cyan}{3.\ Non\ Agentic\ LLM\ Systems\ --LLMs\ without\ tool\ access}$

  • Since llm doesn't have tool access, it is not capable of fetching data 'on its own' to answer the query.
  • Hence to answer the queries, it has two methods :
    • A. use its own internal knowledge that it was trained on (this can lead to wrong answers cause data that it was trained on is now outdated)
    • B. use the context that 'user provides' to the llm. If this context is huge then it becomes really important how your llm goes through this context, there are many ways, more details available here.
  • Once you have the right context, using efficient prompting techinques also becomes really important.
    • Use dynamic few shots
    • Use dynamic prompts

$\color{cyan}{4.\ LLM\ Agents\ --LLMs\ with\ tool\ access}$

Now since we have given our LLMs access to tools, it can use these tools to get real-time data. You can classify tools into following broad categories :

  • API tools (Structured Data) : These are simple tools over existing apis e.g. twitter tools which will return json format data
  • Web Tool (Unstructured Data) : Now there are ample of websites on the internet which do not provide an API to access their information hence to get information from such website we need to go to their website url and somehow get the data.
    • Web Search Tool (for Static Web Pages) : getting data from such pages is simple, its like downloading the page and using it.
    • Web Agent Tool (for Dynamic Web Pages) : getting data from such pages is difficult and if you use download and use method then you will miss information. To understand this better read this blog. Hence to solve this people built a web agent which can navigate the web just like how humans do.

There are many frameworks that you can use to build these LLM Systems, few good ones are DSPY || ⁠AutoGen || Langraph || CrewAI

$\color{red}{4.A]\ Single\ Agentic\ System}$

A single agentic system does not means that you do one single llm call, it just means that you just have 1 single agent but that agent can be called multiple times also.

$\color{red}{4.B]\ Multi\ Agentic\ System}$

When to use a multi-agentic system? When even after performing all the context optimization techniques that we saw for single agentic system, still we face context-rot issue. In such situations, idea is to use multiple agents with each having its own context and its own specific tools.

About

Implemented projects from noncontextual word embeddings like Word2Vec to contextual word embeddings like ELMO, GPT, BERT to solving NLP tasks Sentence Level Classification (sentimental analysis & Toxic comment classification), Token Level Classification (POS Tagging, NER Tagging), Machine Translation (MT)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published