From fd29e96c972cf92b8c29a2fee17ef82aa84d43dd Mon Sep 17 00:00:00 2001 From: sunishbharat Date: Mon, 24 Nov 2025 22:31:00 +0100 Subject: [PATCH 1/4] Update README.md Update readme --- README.md | 138 +++--------------------------------------------------- 1 file changed, 6 insertions(+), 132 deletions(-) diff --git a/README.md b/README.md index 23908a3..e451564 100644 --- a/README.md +++ b/README.md @@ -1,139 +1,13 @@ -# LLMX - An API for Chat Fine-Tuned Language Models +# LLMX with Ollama extension +This repository is a fork of llmx with added support for running Ollama models locally. +It extends llmx by integrating locally hosted Ollama models and their execution features. +You can install this fork directly from the GitHub repository using pip. -[![PyPI version](https://badge.fury.io/py/llmx.svg)](https://badge.fury.io/py/llmx) +Use this version if you want seamless integration of Ollama models within the llmx workflow. +Contributions and feedback are welcome to further improve Ollama compatibility. -A simple python package that provides a unified interface to several LLM providers of chat fine-tuned models [OpenAI, AzureOpenAI, PaLM, Cohere and local HuggingFace Models]. -> **Note** -> llmx wraps multiple api providers and its interface _may_ change as the providers as well as the general field of LLMs evolve. -There is nothing particularly special about this library, but some of the requirements I needed when I started building this (that other libraries did not have): - -- **Unified Model Interface**: Single interface to create LLM text generators with support for **multiple LLM providers**. - -```python -from llmx import llm - -gen = llm(provider="openai") # support azureopenai models too. -gen = llm(provider="palm") # or google -gen = llm(provider="cohere") # or palm -gen = llm(provider="hf", model="HuggingFaceH4/zephyr-7b-beta", device_map="auto") # run huggingface model locally -``` - -- **Unified Messaging Interface**. Standardizes on the OpenAI ChatML message format and is designed for _chat finetuned_ models. For example, the standard prompt sent a model is formatted as an array of objects, where each object has a role (`system`, `user`, or `assistant`) and content (see below). A single request is list of only one message (e.g., write code to plot a cosine wave signal). A conversation is a list of messages e.g. write code for x, update the axis to y, etc. Same format for all models. - -```python -messages = [ - {"role": "user", "content": "You are a helpful assistant that can explain concepts clearly to a 6 year old child."}, - {"role": "user", "content": "What is gravity?"} -] -``` - -- **Good Utils (e.g., Caching etc)**: E.g. being able to use caching for faster responses. General policy is that cache is used if config (including messages) is the same. If you want to force a new response, set `use_cache=False` in the `generate` call. - -```python -response = gen.generate(messages=messages, config=TextGeneratorConfig(n=1, use_cache=True)) -``` - -Output looks like - -```bash - -TextGenerationResponse( - text=[Message(role='assistant', content="Gravity is like a magical force that pulls things towards each other. It's what keeps us on the ground and stops us from floating away into space. ... ")], - config=TextGenerationConfig(n=1, temperature=0.1, max_tokens=8147, top_p=1.0, top_k=50, frequency_penalty=0.0, presence_penalty=0.0, provider='openai', model='gpt-4', stop=None), - logprobs=[], usage={'prompt_tokens': 34, 'completion_tokens': 69, 'total_tokens': 103}) - -``` - -Are there other libraries that do things like this really well? Yes! I'd recommend looking at [guidance](https://github.com/microsoft/guidance) which does a lot more. Interested in optimized inference? Try somthing like [vllm](https://github.com/vllm-project/vllm). - -## Installation - -Install from pypi. Please use **python3.10** or higher. - -```bash -pip install llmx -``` - -Install in development mode - -```bash -git clone -cd llmx -pip install -e . -``` - -Note that you may want to use the latest version of pip to install this package. -`python3 -m pip install --upgrade pip` - -## Usage - -Set your api keys first for each service. - -```bash -# for openai and cohere -export OPENAI_API_KEY= -export COHERE_API_KEY= - -# for PALM via MakerSuite -export PALM_API_KEY= - -# for PaLM (Vertex AI), setup a gcp project, and get a service account key file -export PALM_SERVICE_ACCOUNT_KEY_FILE= -export PALM_PROJECT_ID= -export PALM_PROJECT_LOCATION= -``` - -You can also set the default provider and list of supported providers via a config file. Use the yaml format in this [sample `config.default.yml` file](llmx/configs/config.default.yml) and set the `LLMX_CONFIG_PATH` to the path of the config file. - -```python -from llmx import llm -from llmx.datamodel import TextGenerationConfig - -messages = [ - {"role": "system", "content": "You are a helpful assistant that can explain concepts clearly to a 6 year old child."}, - {"role": "user", "content": "What is gravity?"} -] - -openai_gen = llm(provider="openai") -openai_config = TextGenerationConfig(model="gpt-4", max_tokens=50) -openai_response = openai_gen.generate(messages, config=openai_config, use_cache=True) -print(openai_response.text[0].content) - -``` - -See the [tutorial](/notebooks/tutorial.ipynb) for more examples. - -## A Note on Using Local HuggingFace Models - -While llmx can use the huggingface transformers library to run inference with local models, you might get more mileage from using a well-optimized server endpoint like [vllm](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html#openai-compatible-server), or FastChat. The general idea is that these tools let you provide an openai-compatible endpoint but also implement optimizations such as dynamic batching, quantization etc to improve throughput. The general steps are: - -- install vllm, setup endpoint e.g., on port `8000` -- use openai as your provider to access that endpoint. - -```python -from llmx import llm -hfgen_gen = llm( - provider="openai", - api_base="http://localhost:8000", - api_key="EMPTY, -) -... -``` - -## Current Work - -- Supported models - - [x] OpenAI - - [x] PaLM ([MakerSuite](https://developers.generativeai.google/api/rest/generativelanguage), [Vertex AI](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/models)) - - [x] Cohere - - [x] HuggingFace (local) - -## Caveats - -- **Prompting**. llmx makes some assumptions around how prompts are constructed e.g., how the chat message interface is assembled into a prompt for each model type. If your application or use case requires more control over the prompt, you may want to use a different library (ideally query the LLM models directly). -- **Inference Optimization**. For hosted models (GPT-4, PalM, Cohere) etc, this library provides an excellent unified interface as the hosted api already takes care of inference optimizations. However, if you are looking for a library that is optimized for inference with **_local models_(e.g., huggingface)** (tensor parrelization, distributed inference etc), I'd recommend looking at [vllm](https://github.com/vllm-project/vllm) or [tgi](https://github.com/huggingface/text-generation-inference). ## Citation From 501a229f50bed6e710e72146b2cc65f8db349a94 Mon Sep 17 00:00:00 2001 From: Sunish B Date: Mon, 24 Nov 2025 22:45:15 +0100 Subject: [PATCH 2/4] Added support for ollama extension. --- llmx/generators/text/ollama_textgen.py | 76 ++++++++++++++++++++++++++ llmx/generators/text/textgen.py | 13 ++++- tests/test_generators.py | 14 +++++ 3 files changed, 102 insertions(+), 1 deletion(-) create mode 100644 llmx/generators/text/ollama_textgen.py diff --git a/llmx/generators/text/ollama_textgen.py b/llmx/generators/text/ollama_textgen.py new file mode 100644 index 0000000..f7c2df7 --- /dev/null +++ b/llmx/generators/text/ollama_textgen.py @@ -0,0 +1,76 @@ +from typing import Union, List, Dict +from .base_textgen import TextGenerator +from ...datamodel import Message, TextGenerationConfig, TextGenerationResponse +from ...utils import cache_request, get_models_maxtoken_dict, num_tokens_from_messages +import os +import ollama +import warnings +from dataclasses import asdict + + +class OllamaTextGenerator(TextGenerator): + def __init__( + self, + api_key: str = os.environ.get("OLLAMA_API_KEY", None), + provider: str = "ollama", + organization: str = None, + model: str = None, + model_name: str = None, + models: Dict = None, + ): + super().__init__(provider=provider) + self.api_key = api_key or os.environ.get("OLLAMA_API_KEY", None) + + #if self.api_key is None: + # warnings.warn( + # "Ollama API key is not set. Please set the OPENAI_API_KEY environment variable." + # ) + + self.model_name = model_name or "llama3.2:3b" + self.model_max_token_dict = get_models_maxtoken_dict(models) + + for key,value in self.model_max_token_dict.items(): + print(f"{key : }{value}") + + + def generate( + self, + messages: Union[List[dict], str], + config: TextGenerationConfig = TextGenerationConfig(), + **kwargs, + ) -> TextGenerationResponse: + use_cache = config.use_cache + model = config.model or self.model_name + + ollama_config = { + "model": self.model_name, + "prompt": messages, + "temperature": config.temperature, + "k": config.top_k, + "p": config.top_p, + "num_generations": config.n, + "stop_sequences": config.stop, + } + cache_key_params = ollama_config | {"messages": messages} + + if use_cache: + response = cache_request(cache=self.cache, params=cache_key_params) + if response: + print("****** Using Cache ******") + return TextGenerationResponse(**response) + + + response = ollama.chat(model=model, messages=messages) + response_gen = TextGenerationResponse( + text=[dict(response.message)], + config=ollama_config + ) + cache_request( + cache=self.cache, params=cache_key_params, values=asdict(response_gen) + ) + return response_gen + + + def count_tokens(self, text) -> int: + numtk = num_tokens_from_messages(text) + return num_tokens_from_messages(text) \ No newline at end of file diff --git a/llmx/generators/text/textgen.py b/llmx/generators/text/textgen.py index 3d86002..9a65797 100644 --- a/llmx/generators/text/textgen.py +++ b/llmx/generators/text/textgen.py @@ -3,6 +3,7 @@ from .palm_textgen import PalmTextGenerator from .cohere_textgen import CohereTextGenerator from .anthropic_textgen import AnthropicTextGenerator +from .ollama_textgen import OllamaTextGenerator import logging logger = logging.getLogger("llmx") @@ -19,9 +20,11 @@ def sanitize_provider(provider: str): return "hf" elif provider.lower() == "anthropic" or provider.lower() == "claude": return "anthropic" + elif provider.lower() == "ollama" or provider.lower() == "ollama": + return "ollama" else: raise ValueError( - f"Invalid provider '{provider}'. Supported providers are 'openai', 'hf', 'palm', 'cohere', and 'anthropic'." + f"Invalid provider '{provider}'. Supported providers are 'openai', 'hf', 'palm', 'cohere', and 'anthropic'.'ollama'," ) @@ -58,6 +61,14 @@ def llm(provider: str = None, **kwargs): return CohereTextGenerator(**kwargs) elif provider.lower() == "anthropic": return AnthropicTextGenerator(**kwargs) + elif provider.lower() == "ollama": + try: + import ollama + except ImportError: + raise ImportError( + "Please install the `ollama` package to use the HFTextGenerator class. pip install ollama" + ) + return OllamaTextGenerator(**kwargs) elif provider.lower() == "hf": try: import transformers diff --git a/tests/test_generators.py b/tests/test_generators.py index 4f4e59c..6b1664c 100644 --- a/tests/test_generators.py +++ b/tests/test_generators.py @@ -74,3 +74,17 @@ def test_hf_local(): assert ("paris" in answer.lower()) assert len(hf_local_response.text) == 2 + + +def test_ollama_local(): + ollama_local_gen = llm( + provider="ollama", + model="llama3.2:3b", + model_name ="llama3.2:3b" + ) + ollama_local_response = ollama_local_gen.generate(messages, config=config) + answer = ollama_local_response.text[0].content + print(ollama_local_response.text[0].content) + +if __name__ == "__main__": + test_ollama_local() From c98a74f2618fb2febf91e67fe115e4bdcca5d33f Mon Sep 17 00:00:00 2001 From: Sunish B Date: Thu, 27 Nov 2025 13:16:06 +0100 Subject: [PATCH 3/4] Handle missing ollama server exception --- llmx/generators/text/ollama_textgen.py | 32 ++++++++++++++++---------- 1 file changed, 20 insertions(+), 12 deletions(-) diff --git a/llmx/generators/text/ollama_textgen.py b/llmx/generators/text/ollama_textgen.py index f7c2df7..a723a25 100644 --- a/llmx/generators/text/ollama_textgen.py +++ b/llmx/generators/text/ollama_textgen.py @@ -4,29 +4,28 @@ from ...utils import cache_request, get_models_maxtoken_dict, num_tokens_from_messages import os import ollama -import warnings +import warnings, requests, logging from dataclasses import asdict class OllamaTextGenerator(TextGenerator): def __init__( self, - api_key: str = os.environ.get("OLLAMA_API_KEY", None), provider: str = "ollama", - organization: str = None, + host: str = "http://localhost:11434", model: str = None, model_name: str = None, models: Dict = None, ): super().__init__(provider=provider) - self.api_key = api_key or os.environ.get("OLLAMA_API_KEY", None) + self.host = host - #if self.api_key is None: - # warnings.warn( - # "Ollama API key is not set. Please set the OPENAI_API_KEY environment variable." - # ) + if not self.is_ollama_running(): + raise RuntimeError( + "Ollama is not running. Please start ('ollama serve') and ensure port is reachable." + ) - self.model_name = model_name or "llama3.2:3b" + self.model_name = model_name or "llama3.1:8b" self.model_max_token_dict = get_models_maxtoken_dict(models) for key,value in self.model_max_token_dict.items(): @@ -42,6 +41,8 @@ def generate( use_cache = config.use_cache model = config.model or self.model_name + #Hack to keep descriptions filled + messages[0]["content"] += "Always fill the description fields." ollama_config = { "model": self.model_name, "prompt": messages, @@ -49,14 +50,13 @@ def generate( "k": config.top_k, "p": config.top_p, "num_generations": config.n, - "stop_sequences": config.stop, } cache_key_params = ollama_config | {"messages": messages} if use_cache: response = cache_request(cache=self.cache, params=cache_key_params) if response: - print("****** Using Cache ******") + logging.warning("****** Using Cache ******") return TextGenerationResponse(**response) @@ -69,7 +69,15 @@ def generate( cache=self.cache, params=cache_key_params, values=asdict(response_gen) ) return response_gen - + + def is_ollama_running(self) -> bool: + try: + r = requests.get(self.host, timeout=2) + return True + except requests.exceptions.ConnectionError: + return False + except requests.exceptions.Timeout: + return False def count_tokens(self, text) -> int: numtk = num_tokens_from_messages(text) From 81d801b6a50d52f8509108cd30c70ecda39a2e4d Mon Sep 17 00:00:00 2001 From: Sunish B Date: Thu, 27 Nov 2025 13:27:00 +0100 Subject: [PATCH 4/4] sync README with main --- README.md | 54 ++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 42 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index e451564..9784ae4 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@ # LLMX with Ollama extension -This repository is a fork of llmx with added support for running Ollama models locally. +This repository is a fork of [llmx](https://github.com/victordibia/llmx) with added support for running Ollama models locally. It extends llmx by integrating locally hosted Ollama models and their execution features. You can install this fork directly from the GitHub repository using pip. @@ -8,18 +8,48 @@ Contributions and feedback are welcome to further improve Ollama compatibility. +## Prerequisite-Ollama local setup +Prerequisite: A working local Ollama setup must be installed and running on your machine before using this fork -## Citation +Go to the official Ollama website (https://ollama.com) and download the installer. +After installation , verify the installation by running the below command from command line. +
+ollama -v
+
-If you use this library in your work, please cite: +To list available models: +
+ollama list
+
+ +To download and run a model i.e. llama3.2:3b +
+ollama run llama3.2:3b
+
+ +## Testing llmx-ollama extension +
+python .\tests\test_generators.py
+
+ +## Usage +```python +from llmx import llm + +# Define your messages and config as needed +messages = [ + {"role": "user", "content": "What is the capital city of Germany?"} +] + +config = TextGenerationConfig( + temperature=0.4, + use_cache=False +) + +ollama_gen = llm(provider="ollama", model="llama3.2:3b") +response = ollama_gen.generate(messages, config=config) +answer = response.text[0].content + +print("Summary:", answer) -```bibtex -@software{victordibiallmx, -author = {Victor Dibia}, -license = {MIT}, -month = {10}, -title = {LLMX - An API for Chat Fine-Tuned Language Models}, -url = {https://github.com/victordibia/llmx}, -year = {2023} -} ```