Knowledge Base Sections ▾

Tools

Tools

LlamaIndex + Gonka AI — RAG applications for pennies

LlamaIndex is a leading framework for building RAG applications and AI agents in Python (a TypeScript version, LlamaIndex.TS, is also available). It handles document loading, chunking, indexing, vector search, and response assembly—you describe the data, and LlamaIndex transforms it into a question-answering system on top of any LLM.

There's one problem—the cost of inference. RAG by its nature is resource-intensive: for each question, the query plus several found context fragments go into the model, and for indexing large collections, embeddings are added. At production volumes, this means thousands of requests per day. With OpenAI ($2.50–15 per 1M tokens) or Anthropic ($3–15 per 1M), even a modest Q&A service turns into tens of thousands of dollars per month.

LlamaIndex natively works with any OpenAI-compatible endpoint via the OpenAILike class. This means that JoinGonka Gateway connects with a few lines of code—without custom providers or patches. The result: the same RAG system works for $0.0005/1M input tokens (output ×3) through the decentralized Gonka network—hundreds to thousands of times cheaper than cloud APIs.

Quick Start: Connecting via OpenAILike

JoinGonka API key: register at gate.joingonka.ai/register—we provide 10M free tokens to start—and create a jg-xxx key in the Dashboard.

Installation:

pip install llama-index llama-index-llms-openai-like

For an arbitrary OpenAI-compatible API, LlamaIndex provides the OpenAILike class from the llama_index.llms.openai_like package. A minimal example of a request to Gonka:

from llama_index.llms.openai_like import OpenAILike

llm = OpenAILike(
    api_base="https://gate.joingonka.ai/v1",
    api_key="jg-your-key",
    model="Qwen/Qwen3-235B-A22B-Instruct-2507-FP8",
    is_chat_model=True,            # Gonka is a chat endpoint
    is_function_calling_model=True, # native tool calling is supported
    context_window=131072,         # 128K for Qwen3-235B
    max_tokens=8192,               # output ceiling via Gateway (Qwen)
)

response = llm.complete("Explain what RAG is in three sentences.")
print(response)

Important about OpenAILike: be sure to specify is_chat_model=True—otherwise, LlamaIndex will go to the completion endpoint, which we don't have. is_function_calling_model=True enables native tool calls. Set context_window according to the model so LlamaIndex correctly handles context.

Example: RAG pipeline with query engine

A classic LlamaIndex scenario is an index over your documents and queries to it via query_engine. The global LLM is set once via Settings.llm, then the entire pipeline automatically uses Gonka.

from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Settings,
)
from llama_index.llms.openai_like import OpenAILike
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# 1. LLM via Gonka (once — globally)
Settings.llm = OpenAILike(
    api_base="https://gate.joingonka.ai/v1",
    api_key="jg-your-key",
    model="Qwen/Qwen3-235B-A22B-Instruct-2507-FP8",
    is_chat_model=True,
    context_window=131072,
    max_tokens=8192,
)

# 2. Local embeddings (free, without OpenAI)
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

# 3. Loading and indexing documents from the ./data folder
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)

# 4. Querying the knowledge base
query_engine = index.as_query_engine()
response = query_engine.query("What is this document about?")
print(response)

Critical nuance about embeddings: by default, VectorStoreIndex uses OpenAI embeddings (text-embedding-ada-002)—these are separate paid calls to OpenAI, not to Gonka. To completely move away from OpenAI, set a local embedding model via Settings.embed_model (as in the example above—HuggingFaceEmbedding, package pip install llama-index-embeddings-huggingface). Then generation goes through Gonka, and vectorization is local and free.

Cost: one RAG pipeline query (search + generation) consumes ~2–5K LLM tokens. Through Gonka, this is fractions of a cent; through OpenAI/Anthropic, it's 3–4 orders of magnitude more expensive. For a stream of thousands of queries per day, the difference turns into tens of thousands of dollars in monthly savings.

Comparison of RAG Load Costs

A RAG application is not a one-time chat but a continuous stream of requests: each user question pulls ~2–5K LLM tokens (the question itself plus found context fragments). Let's calculate typical volumes and what they cost on different providers. Gonka prices via JoinGonka Gateway: input ~$0.0005/1M, output ×3.

ScenarioLLM TokensOpenAI / AnthropicJoinGonka Gonka
One question to the knowledge base~4K$0.01 — $0.06~$0.000005
Support chatbot (1K queries/day)~4M/day$10 — $60 per day~$0.005 per day
Indexing + Q&A on corpus (1M words)~5M$12 — $75~$0.006
Production service, 50K queries/month~200M/month$500 — $3,000 per month~$0.25 per month

With 10M free tokens, you can debug the entire RAG pipeline, index a test corpus, and run thousands of queries—without spending a cent. At production volumes, JoinGonka Gateway turns RAG from an expensive service into an expense item you might not even notice.

Agents, tool calling and model selection

LlamaIndex can not only answer based on documents but also build agents with tools. All three Gonka models support native tool calling—agents call functions structurally, without text parsing. Example of an agent with a tool:

import asyncio
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai_like import OpenAILike

llm = OpenAILike(
    api_base="https://gate.joingonka.ai/v1",
    api_key="jg-your-key",
    model="Qwen/Qwen3-235B-A22B-Instruct-2507-FP8",
    is_chat_model=True,
    is_function_calling_model=True,
    context_window=131072,
    max_tokens=8192,
)

def multiply(a: float, b: float) -> float:
    """Multiplies two numbers."""
    return a * b

agent = FunctionAgent(
    tools=[multiply],
    llm=llm,
    system_prompt="You are a helpful assistant. Use tools for calculations.",
)

async def main():
    result = await agent.run("What is 1234 multiplied by 5678?")
    print(result)

asyncio.run(main())

Model selection (model field and corresponding context_window / max_tokens limits):

Model (model)ContextMax OutputWhen to use
Qwen/Qwen3-235B-A22B-Instruct-2507-FP8128K8192Default: RAG, agents, long answers
moonshotai/Kimi-K2.6128K3072Strong reasoning and tool calling
MiniMaxAI/MiniMax-M2.7128K4096Alternative for agent tasks

The max_tokens limit via Gateway is up to 8192 for Qwen3; for Kimi and MiniMax, specify 3072 and 4096 respectively. If max_tokens is not specified for a non-streaming request, up to 1500 tokens will be returned by default—for RAG answers and agent steps, set the value explicitly.

TypeScript: For LlamaIndex.TS, there's a mirroring path—the OpenAI class from the @llamaindex/openai package accepts baseURL and apiKey (or reads OPENAI_BASE_URL / OPENAI_API_KEY variables), so the same Gateway connects in the Node.js stack. If you're building AI applications and on Python frameworks, also check out the guide for LangChain.

LlamaIndex + Gonka = production-ready RAG and agents for fractions of a cent. Connection via OpenAILike (is_chat_model=True), native tool calling, local embeddings — input $0.0005/1M instead of $2.50–15 for OpenAI. 10M free tokens are enough to debug the entire pipeline.

Want to learn more?

Explore other sections or start earning GNK right now.

Get 10M free tokens →