How to Build a RAG App for a Niche Market — Lessons from Three Production Systems

I have built multiple RAG systems in production across different markets and languages. Before going into technical details, let me show you what I mean by "niche market RAG" and why it is worth building.

houkaitei.gertech.jp — AI assistant for Japanese pharmacists, grounded in Japanese government pharmaceutical regulations and the complete national drug database
huuli.tech — AI legal assistant for Mongolian lawyers, trained on all Mongolian laws and court rulings
roppolab.jp — AI study assistant for Japanese bar exam students, covering all Japanese laws, court rulings, and academic commentary

These are not demos. They are paid products used by real professionals every day.

This article is everything I wish someone had told me before I built the first one.

The Core Insight: ChatGPT Is Your Benchmark, Not Your Competition

ChatGPT's free tier is genuinely impressive. As a solo developer, competing on breadth is a losing game. You will never out-general OpenAI.

But for a Mongolian mining lawyer, or a Japanese hospital pharmacist, ChatGPT is mediocre. It does not know the specific regulation. It cannot cite the exact court ruling. It confabulates when pushed on narrow domain specifics.

Your advantage is precisely that you are narrow.

A RAG system built on the complete corpus of Mongolian law returns more reliable answers than ChatGPT for Mongolian legal questions — full stop. It cites the actual law. The user can click the citation and verify. That trust is worth paying for.

This is the business case: better than ChatGPT for one specific thing, with citations, in the language of the domain.

What Is RAG?

RAG stands for Retrieval-Augmented Generation. Instead of asking an LLM to answer from memory, you:

Store your domain documents as searchable vectors
At query time, find the most relevant passages
Give those passages to the LLM as context
Ask the LLM to answer only from what you gave it, with citations

That is the whole system. The data pipeline runs once (and then incrementally). The application runs on every query.

Starting Point: Use the Vercel AI Chatbot Template

Do not build the chat UI from scratch. Use vercel.com/templates/next.js/chatbot as your open-source full-stack starting point.

It is minimal but complete. It handles:

Streaming AI responses out of the box
Database integration (Postgres via Drizzle or Prisma)
Auth (NextAuth)
A clean chat UI you can adapt

Your job is to wire in your retrieval layer — the rest is already done. This alone saves you two to three weeks.

Getting Free AI API Credits

You do not need to spend money to validate your idea. Start here:

Option	What you get	Notes
build.nvidia.com	Free AI API with many models	Slow and unreliable — start here to test
New GCP account	$300 credit for 90 days	Use for Gemini API. Strong multilingual performance
New AWS account	$200 credit for 1 year	Use for AWS Bedrock (Claude, Llama, etc.)
ChatGPT Pro trial	1 month free for new accounts	Unlocks Codex with high rate limits

For Japanese or Mongolian language tasks, Gemini (Google) performs notably well. For legal and structured reasoning, Claude (Anthropic via Bedrock) is strong.

Free Coding Assistants

As a solo developer building something serious, use every free tool available:

Cursor — 1 year free with a student email. The best AI code editor by a margin
Codex (OpenAI) — strong for code generation, available with ChatGPT Pro
Windsurf / Cline — both have free tiers worth using for scaffolding

Do not pay for coding tools until you have paying users.

The Data Pipeline: Where Your Moat Lives

The AI model is a commodity. Anyone can call the same API. Your data pipeline is your competitive advantage.

For huuli.tech, the data is every Mongolian law and every published court ruling. No one else has that in a clean, searchable, chunked, embedded form optimized for retrieval. That corpus is the product.

Step 1: Scraping

Write a Python script to scrape your source. For government sites, httpx + BeautifulSoup is usually enough. For paginated databases, check if there is an API first — it saves you weeks.

import httpx
from bs4 import BeautifulSoup

def scrape_law_page(url: str) -> str:
    resp = httpx.get(url, timeout=30)
    soup = BeautifulSoup(resp.text, "html.parser")
    # Target the main content area, strip nav/footer
    content = soup.select_one("#main-content") or soup.select_one("article")
    return content.get_text(separator="\n", strip=True) if content else ""

Step 2: Extraction and OCR

If your source PDFs are text-based, use pdfplumber or pymupdf. If they are scanned images (common with older government documents), you need OCR. Warning: bad OCR is the silent killer of RAG quality. If your OCR produces garbage, your embeddings are garbage, and your retrieval is garbage — but the LLM will still generate confident-sounding wrong answers. Always spot-check your extracted text.

import pdfplumber

def extract_pdf(path: str) -> str:
    with pdfplumber.open(path) as pdf:
        return "\n\n".join(
            page.extract_text() or "" for page in pdf.pages
        )

# For scanned PDFs, use OCR:
# import pytesseract
# from pdf2image import convert_from_path
# images = convert_from_path(path)
# text = "\n\n".join(pytesseract.image_to_string(img, lang="jpn") for img in images)

Chunking Strategy: The Most Important Decision

Bad chunking is the most common reason RAG systems fail. If your chunks are too large, retrieval is noisy. Too small, and they lose context. Wrong boundaries, and you split an article in half and retrieve half-answers.

The rule: chunk by meaning, not by character count.

General Starting Point

Parameter	Value	Why
Chunk size	300–800 tokens	Fits comfortably in retrieval context
Overlap	10–20%	Preserves context at boundaries
Split by	Section/article boundary first	Meaning over length

For Legal and Regulatory Documents (huuli.tech, houkaitei, roppolab)

Chunk by legal article unit (条 in Japanese). Each article is self-contained and citable. Do not split mid-article.

import re

def chunk_japanese_law(text: str) -> list[dict]:
    """Split a Japanese law document by article (条文)."""
    # Split on article headers like "第１条", "第2条", "第十条"
    pattern = r'(第[一二三四五六七八九十百千\d]+条)'
    parts = re.split(pattern, text)
    chunks = []
    for i in range(1, len(parts), 2):
        article_num = parts[i]
        article_text = parts[i + 1].strip() if i + 1 < len(parts) else ""
        if article_text:
            chunks.append({
                "article": article_num,
                "text": f"{article_num}\n{article_text}",
                "char_count": len(article_text),
            })
    return chunks

For Mongolian law (huuli.tech), the structure is similar — laws are organized by article (зүйл). The key insight is the same: preserve the legal unit.

Preserve Hierarchy in Metadata

Every chunk should carry metadata. This is what powers citations.

chunk = {
    "text": "第5条　薬剤師は...",
    "metadata": {
        "law_name": "薬剤師法",
        "law_id": "yakuzaishi-ho",
        "article": "第5条",
        "chapter": "第2章",
        "source_url": "https://elaws.e-gov.go.jp/...",
        "last_updated": "2024-04-01",
    }
}

Embeddings and Storage

Use pgvector on PostgreSQL (Supabase gives you this for free). It is simpler to operate than a dedicated vector database when you are a solo developer — one less system to run.

# Using OpenAI embeddings — swap for a local model in regulated industries
from openai import OpenAI
import psycopg2
import json

client = OpenAI()

def embed(text: str) -> list[float]:
    resp = client.embeddings.create(input=text, model="text-embedding-3-small")
    return resp.data[0].embedding

def store_chunk(conn, chunk: dict, embedding: list[float]):
    with conn.cursor() as cur:
        cur.execute(
            """
            INSERT INTO chunks (text, metadata, embedding)
            VALUES (%s, %s, %s)
            """,
            (chunk["text"], json.dumps(chunk["metadata"]), embedding),
        )
    conn.commit()

-- PostgreSQL schema
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE chunks (
    id        BIGSERIAL PRIMARY KEY,
    text      TEXT NOT NULL,
    metadata  JSONB,
    embedding VECTOR(1536)
);

CREATE INDEX ON chunks USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);

The Application: Query Flow

On every user message:

def answer(question: str, conn) -> dict:
    # 1. Embed the question
    q_vec = embed(question)

    # 2. Retrieve top 10 closest chunks
    with conn.cursor() as cur:
        cur.execute(
            """
            SELECT text, metadata
            FROM chunks
            ORDER BY embedding <=> %s
            LIMIT 10
            """,
            (q_vec,),
        )
        rows = cur.fetchall()

    context_blocks = [
        f"[{r[1].get('article', 'Source')} — {r[1].get('law_name', '')}]\n{r[0]}"
        for r in rows
    ]
    context = "\n\n---\n\n".join(context_blocks)

    # 3. Ask the LLM — only answer from the provided context
    prompt = f"""You are a legal/regulatory assistant.
Answer the question below using ONLY the provided excerpts.
Cite the specific article and law name for each point you make.
If the answer is not in the excerpts, say so clearly.

EXCERPTS:
{context}

QUESTION: {question}"""

    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )

    return {
        "answer": resp.choices[0].message.content,
        "sources": [r[1] for r in rows],
    }

Pricing and Payments

International (Japan, US, Europe)

Stripe is the obvious choice. roppolab.jp runs on Stripe. Integration is straightforward and well-documented. Build a simple subscription model — monthly or annual.

Usage restrictions are easy to implement: track API calls or tokens consumed per user in your database and block or downgrade when they hit the limit.

Mongolia

Stripe is not available in Mongolia. Handle payments directly via bank transfer or use local payment providers. For huuli.tech, direct bank payment works fine — Mongolian users are accustomed to this.

Simple Evaluation Loop

Do not skip this. Before launching, you need to know whether your RAG is actually good.

The setup:

Write 15–20 domain questions with known correct answers
Run your RAG system on each question
Ask another LLM to grade the response against the correct answer (1–10 with explanation)
Manually check: are the retrieved chunks actually relevant? Are citations correct?

from openai import OpenAI

client = OpenAI()

EVAL_QUESTIONS = [
    {
        "question": "What is the penalty for dispensing without a prescription under Article 24?",
        "correct_answer": "A fine of not more than 300,000 yen under Article 86.",
    },
    # ... more questions
]

def grade(question: str, rag_answer: str, correct: str) -> dict:
    prompt = f"""Grade the following AI answer on a scale of 1–10.

Question: {question}
Correct answer: {correct}
AI answer: {rag_answer}

Return JSON: {{"score": <1-10>, "explanation": "<brief reason>"}}"""

    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
    )
    return resp.choices[0].message.content

for item in EVAL_QUESTIONS:
    result = answer(item["question"], conn)
    grade_result = grade(item["question"], result["answer"], item["correct_answer"])
    print(f"Q: {item['question'][:60]}...")
    print(f"Score: {grade_result}")
    print()

Run this eval loop every time you change your chunking strategy or prompt. It takes 10 minutes and catches regressions before users do.

Advanced RAG Patterns (Do Not Build These First)

Get the basic RAG working and validated first. Then consider these if you need to improve quality:

Reranker

Embeddings are fast but approximate. A reranker model reads the actual text of the query and each candidate chunk and scores their relevance precisely. The workflow becomes:

Embedding search → top 50 candidates → reranker → top 10 → LLM Cohere Rerank and cross-encoder/ms-marco-MiniLM-L-6-v2 (free, local) are good options.

Router

A cheap LLM at the gate classifies the query before retrieval. Useful because you do not want to run retrieval on "hello" or "thanks". It also lets you route to different corpora — e.g., statutes vs. case law vs. commentary.

Web Search

For queries about recent events not in your corpus, a simple web search fallback (Serper API, Tavily) prevents the "I don't have that information" dead-end.

AI Beyond RAG

RAG is the starting point — it validates that there is demand and that users trust your system's answers. Once you have that, more advanced patterns open up.

AI Document Reviewer

Instead of answering questions, the AI reads a document the user uploads, adds comments like a senior colleague would, and highlights issues. Think Google Docs comments, but generated by AI trained on your domain.

This is one of the best uses of current AI capability — it is genuinely useful and reliably correct when grounded in domain knowledge. Both huuli.tech and roppolab.jp have implemented this for contract review and exam answer review respectively.

Multi-Agent Deep Researcher

For complex questions requiring synthesis across many documents, a multi-agent system takes multiple passes — retrieving, reasoning, identifying gaps, retrieving again — until it produces a comprehensive research summary.

This is more complex to build and harder to evaluate, but it is genuinely powerful for professional research workflows.

AI Contract Drafting

AI can draft contracts and documents. In my opinion, the current models are not reliable enough for this in professional settings without heavy human review. The risk of subtle errors in legal documents is too high. Focus on review and Q&A before you build drafting tools.

Analytics: Track What Matters

Use PostHog (free tier is generous). Instrument two things immediately:

What questions users are asking (tells you what content to improve)
Thumbs up/down on responses (tells you where the RAG is failing)

That feedback loop is how you iterate from "good enough" to "genuinely better than anything else."

Marketing and Distribution

Building is 20% of the work. The rest is getting people to use it.

SEO — Let Your AI Write About Itself

For both huuli.tech and roppolab, we use the AI to generate articles on important topics in the domain — one article per day, automatically, based on what questions users are asking.

This is the flywheel: users ask questions → you identify common topics → AI writes articles → articles rank on Google → more users find the product → more questions → repeat.

The content is accurate because the AI is grounded in your corpus. It naturally cites relevant laws and regulations. Google rewards this.

Social Media Automation

For distribution in communities where your target users are active:

Facebook groups — post in relevant professional groups, send friend requests, start conversations. Legal and medical professional communities in Mongolia and Japan are highly active on Facebook
Twitter/X automation — automated posting of domain insights. See @roppo_lab as an example — it posts automatically about Japanese law topics

The content for these posts can be generated by the same system that writes your articles. One pipeline, multiple distribution channels.

Summary: The Full Picture

RAG is the fastest path from "I know this domain deeply" to "I have a product that professionals will pay for." The AI model is a commodity. Your domain corpus, your chunking strategy, and your community trust are not.

Start narrow. Validate early. Use every free resource available. Ship before it is perfect.

The three products linked in this article are live and serving real users. If you are building something similar and want to talk through your approach — reach out to the SoduraAI team.