LlamaIndex + Gonka AI — RAG applications for pennies

LlamaIndex is a leading framework for building RAG applications and AI agents in Python (there is also a TypeScript version, LlamaIndex.TS). It handles document loading, chunking, indexing, vector search, and response synthesis—you describe the data, and LlamaIndex turns it into a question-answering system on top of any LLM.

The only problem is the cost of inference. RAG is inherently resource-intensive: for every question, a request plus several retrieved context fragments go to the model, and embeddings are added for indexing large collections. At production scale, this means thousands of requests per day. With OpenAI ($2.50–$15 per 1M tokens) or Anthropic ($3–$15 per 1M), even a modest Q&A service turns into tens of thousands of dollars per month.

LlamaIndex works natively with any OpenAI-compatible endpoint through the OpenAILike class. This means JoinGonka Gateway can be connected in just a few lines—without custom providers or patches. The result: the same RAG system runs for $0.003/1M input tokens (output ×3) via the decentralized Gonka network—hundreds and thousands of times cheaper than cloud APIs.

Quick Start: Connecting via OpenAILike

JoinGonka API key: register at gate.joingonka.ai/register — we give you 10M free tokens to get started — and create a jg-xxx key in the Dashboard.

Installation:

pip install llama-index llama-index-llms-openai-like

For any arbitrary OpenAI-compatible API, LlamaIndex provides the OpenAILike class from the llama_index.llms.openai_like package. A minimal request example to Gonka:

from llama_index.llms.openai_like import OpenAILike

llm = OpenAILike(
    api_base="https://gate.joingonka.ai/v1",
    api_key="jg-your-key",
    model="MiniMaxAI/MiniMax-M2.7",
    is_chat_model=True,            # Gonka is a chat endpoint
    is_function_calling_model=True, # native tool calling is supported
    context_window=200000,         # 200K for network models
    max_tokens=8192,               # output ceiling via Gateway
)

response = llm.complete("Explain what RAG is in three sentences.")
print(response)

Important note on OpenAILike: be sure to specify is_chat_model=True — otherwise, LlamaIndex will attempt to use the completion endpoint, which we do not have. is_function_calling_model=True enables native tool calls. Set context_window according to the model so that LlamaIndex truncates context correctly.

Example: RAG pipeline with query engine

A classic LlamaIndex scenario is an index over your documents and queries to it via query_engine. The global LLM is set once via Settings.llm, and the entire pipeline will use Gonka automatically.

from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Settings,
)
from llama_index.llms.openai_like import OpenAILike
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# 1. LLM via Gonka (global setting)
Settings.llm = OpenAILike(
    api_base="https://gate.joingonka.ai/v1",
    api_key="jg-your-key",
    model="MiniMaxAI/MiniMax-M2.7",
    is_chat_model=True,
    context_window=200000,
    max_tokens=8192,
)

# 2. Local embeddings (free, no OpenAI)
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

# 3. Load and index documents from the ./data directory
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)

# 4. Query the knowledge base
query_engine = index.as_query_engine()
response = query_engine.query("What is this document about?")
print(response)

A critical nuance regarding embeddings: by default, VectorStoreIndex uses OpenAI embeddings (text-embedding-ada-002) — these are separate paid calls to OpenAI, not Gonka. To avoid OpenAI entirely, specify a local embedding model via Settings.embed_model (as shown in the example above — HuggingFaceEmbedding, using pip install llama-index-embeddings-huggingface). In this case, generation occurs via Gonka, and vectorization happens locally and for free.

Cost: a single RAG pipeline request (search + generation) consumes ~2–5K LLM tokens. Via Gonka, this is a fraction of a cent; via OpenAI/Anthropic, it is 3–4 orders of magnitude more expensive. At a throughput of thousands of requests per day, the difference results in tens of thousands of dollars in savings per month.

Comparison of RAG Load Costs

A RAG application is not a one-time chat but a continuous stream of requests: every user question pulls in 2–5K LLM tokens (the question itself plus the found context fragments). Let's calculate typical volumes and costs across different providers. Gonka prices via JoinGonka Gateway: input ~$0.003/1M, output ×3.

Scenario	LLM Tokens	OpenAI / Anthropic	JoinGonka Gonka
One question to knowledge base	~4K	$0.01 — $0.06	~$0.00002
Support chatbot (1K requests/day)	~4M/day	$10 — $60 per day	~$0.019 per day
Indexing + Q&A (1M words)	~5M	$12 — $75	~$0.024
Production service, 50K requests/mo	~200M/mo	$500 — $3,000 per mo	~$0.96 per mo

With 10M free tokens, you can debug your entire RAG pipeline, index a test corpus, and run thousands of queries—all without spending a cent. At production scale, JoinGonka Gateway turns RAG from an expensive service into a negligible expense line item.

Agents, tool calling and model selection

LlamaIndex is capable not only of answering based on documents but also building agents with tools. Both Gonka models support native tool calling — agents call functions in a structured manner without text parsing. Example of an agent with a tool:

import asyncio
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai_like import OpenAILike

llm = OpenAILike(
    api_base="https://gate.joingonka.ai/v1",
    api_key="jg-your-key",
    model="MiniMaxAI/MiniMax-M2.7",
    is_chat_model=True,
    is_function_calling_model=True,
    context_window=200000,
    max_tokens=8192,
)

def multiply(a: float, b: float) -> float:
    """Multiplies two numbers."""
    return a * b

agent = FunctionAgent(
    tools=[multiply],
    llm=llm,
    system_prompt="You are a helpful assistant. Use tools for calculations.",
)

async def main():
    result = await agent.run("How much is 1234 multiplied by 5678?")
    print(result)

asyncio.run(main())

Model selection (model field and corresponding context_window / max_tokens limits):

Model (`model`)	Context	Max Output	Use Case
`moonshotai/Kimi-K2.6`	200K	8192	Default: strong reasoning, tool calling, agents
`MiniMaxAI/MiniMax-M2.7`	200K	8192	RAG, long context, long answers

The max_tokens limit via Gateway is up to 8192 for all network models. If max_tokens is not specified for a non-streaming request, it defaults to 1500 tokens — for RAG responses and agent steps, set this value explicitly.

TypeScript: for LlamaIndex.TS there is a mirror path — the OpenAI class from the @llamaindex/openai package accepts baseURL and apiKey (or reads the OPENAI_BASE_URL / OPENAI_API_KEY environment variables), so the same Gateway can be connected in the Node.js stack. If you are building AI applications using Python frameworks, also check out our LangChain guide.

LlamaIndex + Gonka = production-ready RAG and agents for a fraction of a cent. Connect via OpenAILike (is_chat_model=True), native tool calling, and local embeddings—input $0.003/1M instead of $2.50–$15 with OpenAI. 10M free tokens are enough to debug your entire pipeline.

← Roo Code + Gonka AI — Autonomous AI Agent in VS Code PydanticAI + Gonka — typed AI agents for pennies →

Want to learn more?

Explore other sections or start earning GNK right now.

Get 10M free tokens →