On this tutorial, we implement a totally purposeful Ollama setting inside Google Colab to duplicate a self-hosted LLM workflow. We start by putting in Ollama straight on the Colab VM utilizing the official Linux installer after which launch the Ollama server within the background to show the HTTP API on localhost:11434. After verifying the service, we pull light-weight fashions reminiscent of qwen2.5:0.5b-instruct or llama3.2:1b, which steadiness useful resource constraints with usability in a CPU-only setting. To work together with these fashions programmatically, we use the /api/chat endpoint through Python’s requests module with streaming enabled, which permits token-level output to be captured incrementally. Lastly, we layer a Gradio-based UI on high of this consumer so we will situation prompts, keep multi-turn historical past, configure parameters like temperature and context dimension, and examine ends in actual time. Try the Full Codes right here.
import os, sys, subprocess, time, json, requests, textwrap
from pathlib import Path
def sh(cmd, test=True):
"""Run a shell command, stream output."""
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, textual content=True)
for line in p.stdout:
print(line, finish="")
p.wait()
if test and p.returncode != 0:
elevate RuntimeError(f"Command failed: {cmd}")
if not Path("/usr/native/bin/ollama").exists() and never Path("/usr/bin/ollama").exists():
print("🔧 Putting in Ollama ...")
sh("curl -fsSL https://ollama.com/set up.sh | sh")
else:
print("✅ Ollama already put in.")
strive:
import gradio
besides Exception:
print("🔧 Putting in Gradio ...")
sh("pip -q set up gradio==4.44.0")
We first test if Ollama is already put in on the system, and if not, we set up it utilizing the official script. On the similar time, we guarantee Gradio is offered by importing it or putting in the required model when lacking. This manner, we put together our Colab setting for working the chat interface easily. Try the Full Codes right here.
def start_ollama():
strive:
requests.get("http://127.0.0.1:11434/api/tags", timeout=1)
print("✅ Ollama server already working.")
return None
besides Exception:
cross
print("🚀 Beginning Ollama server ...")
proc = subprocess.Popen(["ollama", "serve"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, textual content=True)
for _ in vary(60):
time.sleep(1)
strive:
r = requests.get("http://127.0.0.1:11434/api/tags", timeout=1)
if r.okay:
print("✅ Ollama server is up.")
break
besides Exception:
cross
else:
elevate RuntimeError("Ollama didn't begin in time.")
return proc
server_proc = start_ollama()
We begin the Ollama server within the background and maintain checking its well being endpoint till it responds efficiently. By doing this, we make sure the server is working and prepared earlier than sending any API requests. Try the Full Codes right here.
MODEL = os.environ.get("OLLAMA_MODEL", "qwen2.5:0.5b-instruct")
print(f"🧠 Utilizing mannequin: {MODEL}")
strive:
tags = requests.get("http://127.0.0.1:11434/api/tags", timeout=5).json()
have = any(m.get("identify")==MODEL for m in tags.get("fashions", []))
besides Exception:
have = False
if not have:
print(f"⬇️ Pulling mannequin {MODEL} (first time solely) ...")
sh(f"ollama pull {MODEL}")
We outline the default mannequin to make use of, test whether it is already obtainable on the Ollama server, and if not, we robotically pull it. This ensures that the chosen mannequin is prepared earlier than we begin working any chat classes. Try the Full Codes right here.
OLLAMA_URL = "http://127.0.0.1:11434/api/chat"
def ollama_chat_stream(messages, mannequin=MODEL, temperature=0.2, num_ctx=None):
"""Yield streaming textual content chunks from Ollama /api/chat."""
payload = {
"mannequin": mannequin,
"messages": messages,
"stream": True,
"choices": {"temperature": float(temperature)}
}
if num_ctx:
payload["options"]["num_ctx"] = int(num_ctx)
with requests.submit(OLLAMA_URL, json=payload, stream=True) as r:
r.raise_for_status()
for line in r.iter_lines():
if not line:
proceed
information = json.masses(line.decode("utf-8"))
if "message" in information and "content material" in information["message"]:
yield information["message"]["content"]
if information.get("completed"):
break
We create a streaming consumer for the Ollama /api/chat endpoint, the place we ship messages as JSON payloads and yield tokens as they arrive. This lets us deal with responses incrementally, so we see the mannequin’s output in actual time as an alternative of ready for the total completion. Try the Full Codes right here.
def smoke_test():
print("n🧪 Smoke check:")
sys_msg = {"function":"system","content material":"You might be concise. Use quick bullets."}
user_msg = {"function":"person","content material":"Give 3 fast tricks to sleep higher."}
out = []
for chunk in ollama_chat_stream([sys_msg, user_msg], temperature=0.3):
print(chunk, finish="")
out.append(chunk)
print("n🧪 Performed.n")
strive:
smoke_test()
besides Exception as e:
print("⚠️ Smoke check skipped:", e)
We run a fast smoke check by sending a easy immediate by our streaming consumer to verify that the mannequin responds accurately. This helps us confirm that Ollama is put in, the server is working, and the chosen mannequin is working earlier than we construct the total chat UI. Try the Full Codes right here.
import gradio as gr
SYSTEM_PROMPT = "You're a useful, crisp assistant. Desire bullets when useful."
def chat_fn(message, historical past, temperature, num_ctx):
msgs = [{"role":"system","content":SYSTEM_PROMPT}]
for u, a in historical past:
if u: msgs.append({"function":"person","content material":u})
if a: msgs.append({"function":"assistant","content material":a})
msgs.append({"function":"person","content material": message})
acc = ""
strive:
for half in ollama_chat_stream(msgs, mannequin=MODEL, temperature=temperature, num_ctx=num_ctx or None):
acc += half
yield acc
besides Exception as e:
yield f"⚠️ Error: {e}"
with gr.Blocks(title="Ollama Chat (Colab)", fill_height=True) as demo:
gr.Markdown("# 🦙 Ollama Chat (Colab)nSmall local-ish LLM through Ollama + Gradio.n")
with gr.Row():
temp = gr.Slider(0.0, 1.0, worth=0.3, step=0.1, label="Temperature")
num_ctx = gr.Slider(512, 8192, worth=2048, step=256, label="Context Tokens (num_ctx)")
chat = gr.Chatbot(top=460)
msg = gr.Textbox(label="Your message", placeholder="Ask something…", strains=3)
clear = gr.Button("Clear")
def user_send(m, h):
m = (m or "").strip()
if not m: return "", h
return "", h + [[m, None]]
def bot_reply(h, temperature, num_ctx):
u = h[-1][0]
stream = chat_fn(u, h[:-1], temperature, int(num_ctx))
acc = ""
for partial in stream:
acc = partial
h[-1][1] = acc
yield h
msg.submit(user_send, [msg, chat], [msg, chat])
.then(bot_reply, [chat, temp, num_ctx], [chat])
clear.click on(lambda: None, None, chat)
print("🌐 Launching Gradio ...")
demo.launch(share=True)
We combine Gradio to construct an interactive chat UI on high of the Ollama server, the place person enter and dialog historical past are transformed into the proper message format and streamed again as mannequin responses. The sliders allow us to modify parameters like temperature and context size, whereas the chat field and clear button present a easy, real-time interface for testing totally different prompts.
In conclusion, we set up a reproducible pipeline for working Ollama in Colab: set up, server startup, mannequin administration, API entry, and person interface integration. The system makes use of Ollama’s REST API because the core interplay layer, offering each command-line and Python streaming entry, whereas Gradio handles session persistence and chat rendering. This strategy preserves the “self-hosted” design described within the unique information however adapts it for Colab’s constraints, the place Docker and GPU-backed Ollama pictures will not be sensible. The result’s a compact but technically full framework that lets us experiment with a number of LLMs, modify era parameters dynamically, and check conversational AI domestically inside a pocket book setting.
Try the Full Codes right here. Be at liberty to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to observe us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Publication.
Sana Hassan, a consulting intern at Marktechpost and dual-degree pupil at IIT Madras, is keen about making use of expertise and AI to deal with real-world challenges. With a eager curiosity in fixing sensible issues, he brings a recent perspective to the intersection of AI and real-life options.
Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the most recent breakthroughs, get unique updates, and join with a world community of future-focused thinkers.
Unlock tomorrow’s developments right now: learn extra, subscribe to our e-newsletter, and turn out to be a part of the NextTech group at NextTech-news.com

