How to Build a RAG App for a Niche Market — Lessons from Three Production Systems
RAG is the simplest serious AI application you can build — and for niche markets, your focused system can outperform ChatGPT. Here's what I learned building three production RAG systems for Japanese pharmacists, Mongolian lawyers, and Japanese bar exam students.
I have built multiple RAG systems in production across different markets and languages. Before going into technical details, let me show you what I mean by "niche market RAG" and why it is worth building.
- houkaitei.gertech.jp — AI assistant for Japanese pharmacists, grounded in Japanese government pharmaceutical regulations and the complete national drug database
- huuli.tech — AI legal assistant for Mongolian lawyers, trained on all Mongolian laws and court rulings
- roppolab.jp — AI study assistant for Japanese bar exam students, covering all Japanese laws, court rulings, and academic commentary
These are not demos. They are paid products used by real professionals every day.
This article is everything I wish someone had told me before I built the first one.
The Core Insight: ChatGPT Is Your Benchmark, Not Your Competition
ChatGPT's free tier is genuinely impressive. As a solo developer, competing on breadth is a losing game. You will never out-general OpenAI.
But for a Mongolian mining lawyer, or a Japanese hospital pharmacist, ChatGPT is mediocre. It does not know the specific regulation. It cannot cite the exact court ruling. It confabulates when pushed on narrow domain specifics.
Your advantage is precisely that you are narrow.
A RAG system built on the complete corpus of Mongolian law returns more reliable answers than ChatGPT for Mongolian legal questions — full stop. It cites the actual law. The user can click the citation and verify. That trust is worth paying for.
This is the business case: better than ChatGPT for one specific thing, with citations, in the language of the domain.
What Is RAG?
RAG stands for Retrieval-Augmented Generation. Instead of asking an LLM to answer from memory, you:
- Store your domain documents as searchable vectors
- At query time, find the most relevant passages
- Give those passages to the LLM as context
- Ask the LLM to answer only from what you gave it, with citations
That is the whole system. The data pipeline runs once (and then incrementally). The application runs on every query.
Starting Point: Use the Vercel AI Chatbot Template
Do not build the chat UI from scratch. Use vercel.com/templates/next.js/chatbot as your open-source full-stack starting point.
It is minimal but complete. It handles:
- Streaming AI responses out of the box
- Database integration (Postgres via Drizzle or Prisma)
- Auth (NextAuth)
- A clean chat UI you can adapt
Your job is to wire in your retrieval layer — the rest is already done. This alone saves you two to three weeks.
Getting Free AI API Credits
You do not need to spend money to validate your idea. Start here:
| Option | What you get | Notes |
|---|---|---|
| build.nvidia.com | Free AI API with many models | Slow and unreliable — start here to test |
| New GCP account | $300 credit for 90 days | Use for Gemini API. Strong multilingual performance |
| New AWS account | $200 credit for 1 year | Use for AWS Bedrock (Claude, Llama, etc.) |
| ChatGPT Pro trial | 1 month free for new accounts | Unlocks Codex with high rate limits |
For Japanese or Mongolian language tasks, Gemini (Google) performs notably well. For legal and structured reasoning, Claude (Anthropic via Bedrock) is strong.
Free Coding Assistants
As a solo developer building something serious, use every free tool available:
- Cursor — 1 year free with a student email. The best AI code editor by a margin
- Codex (OpenAI) — strong for code generation, available with ChatGPT Pro
- Windsurf / Cline — both have free tiers worth using for scaffolding
Do not pay for coding tools until you have paying users.
The Data Pipeline: Where Your Moat Lives
The AI model is a commodity. Anyone can call the same API. Your data pipeline is your competitive advantage.
For huuli.tech, the data is every Mongolian law and every published court ruling. No one else has that in a clean, searchable, chunked, embedded form optimized for retrieval. That corpus is the product.
Step 1: Scraping
Write a Python script to scrape your source. For government sites, httpx + BeautifulSoup is usually enough. For paginated databases, check if there is an API first — it saves you weeks.
import httpx
from bs4 import BeautifulSoup
def scrape_law_page(url: str) -> str:
resp = httpx.get(url, timeout=30)
soup = BeautifulSoup(resp.text, "html.parser")
# Target the main content area, strip nav/footer
content = soup.select_one("#main-content") or soup.select_one("article")
return content.get_text(separator="\n", strip=True) if content else ""
Step 2: Extraction and OCR
If your source PDFs are text-based, use pdfplumber or pymupdf. If they are scanned images (common with older government documents), you need OCR.
Warning: bad OCR is the silent killer of RAG quality. If your OCR produces garbage, your embeddings are garbage, and your retrieval is garbage — but the LLM will still generate confident-sounding wrong answers. Always spot-check your extracted text.
import pdfplumber
def extract_pdf(path: str) -> str:
with pdfplumber.open(path) as pdf:
return "\n\n".join(
page.extract_text() or "" for page in pdf.pages
)
# For scanned PDFs, use OCR:
# import pytesseract
# from pdf2image import convert_from_path
# images = convert_from_path(path)
# text = "\n\n".join(pytesseract.image_to_string(img, lang="jpn") for img in images)
Chunking Strategy: The Most Important Decision
Bad chunking is the most common reason RAG systems fail. If your chunks are too large, retrieval is noisy. Too small, and they lose context. Wrong boundaries, and you split an article in half and retrieve half-answers.
The rule: chunk by meaning, not by character count.
General Starting Point
| Parameter | Value | Why |
|---|---|---|
| Chunk size | 300–800 tokens | Fits comfortably in retrieval context |
| Overlap | 10–20% | Preserves context at boundaries |
| Split by | Section/article boundary first | Meaning over length |
For Legal and Regulatory Documents (huuli.tech, houkaitei, roppolab)
Chunk by legal article unit (条 in Japanese). Each article is self-contained and citable. Do not split mid-article.
import re
def chunk_japanese_law(text: str) -> list[dict]:
"""Split a Japanese law document by article (条文)."""
# Split on article headers like "第1条", "第2条", "第十条"
pattern = r'(第[一二三四五六七八九十百千\d]+条)'
parts = re.split(pattern, text)
chunks = []
for i in range(1, len(parts), 2):
article_num = parts[i]
article_text = parts[i + 1].strip() if i + 1 < len(parts) else ""
if article_text:
chunks.append({
"article": article_num,
"text": f"{article_num}\n{article_text}",
"char_count": len(article_text),
})
return chunks
For Mongolian law (huuli.tech), the structure is similar — laws are organized by article (зүйл). The key insight is the same: preserve the legal unit.
Preserve Hierarchy in Metadata
Every chunk should carry metadata. This is what powers citations.
chunk = {
"text": "第5条 薬剤師は...",
"metadata": {
"law_name": "薬剤師法",
"law_id": "yakuzaishi-ho",
"article": "第5条",
"chapter": "第2章",
"source_url": "https://elaws.e-gov.go.jp/...",
"last_updated": "2024-04-01",
}
}
Embeddings and Storage
Use pgvector on PostgreSQL (Supabase gives you this for free). It is simpler to operate than a dedicated vector database when you are a solo developer — one less system to run.
# Using OpenAI embeddings — swap for a local model in regulated industries
from openai import OpenAI
import psycopg2
import json
client = OpenAI()
def embed(text: str) -> list[float]:
resp = client.embeddings.create(input=text, model="text-embedding-3-small")
return resp.data[0].embedding
def store_chunk(conn, chunk: dict, embedding: list[float]):
with conn.cursor() as cur:
cur.execute(
"""
INSERT INTO chunks (text, metadata, embedding)
VALUES (%s, %s, %s)
""",
(chunk["text"], json.dumps(chunk["metadata"]), embedding),
)
conn.commit()
-- PostgreSQL schema
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE chunks (
id BIGSERIAL PRIMARY KEY,
text TEXT NOT NULL,
metadata JSONB,
embedding VECTOR(1536)
);
CREATE INDEX ON chunks USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
The Application: Query Flow
On every user message:
def answer(question: str, conn) -> dict:
# 1. Embed the question
q_vec = embed(question)
# 2. Retrieve top 10 closest chunks
with conn.cursor() as cur:
cur.execute(
"""
SELECT text, metadata
FROM chunks
ORDER BY embedding <=> %s
LIMIT 10
""",
(q_vec,),
)
rows = cur.fetchall()
context_blocks = [
f"[{r[1].get('article', 'Source')} — {r[1].get('law_name', '')}]\n{r[0]}"
for r in rows
]
context = "\n\n---\n\n".join(context_blocks)
# 3. Ask the LLM — only answer from the provided context
prompt = f"""You are a legal/regulatory assistant.
Answer the question below using ONLY the provided excerpts.
Cite the specific article and law name for each point you make.
If the answer is not in the excerpts, say so clearly.
EXCERPTS:
{context}
QUESTION: {question}"""
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0,
)
return {
"answer": resp.choices[0].message.content,
"sources": [r[1] for r in rows],
}
Pricing and Payments
International (Japan, US, Europe)
Stripe is the obvious choice. roppolab.jp runs on Stripe. Integration is straightforward and well-documented. Build a simple subscription model — monthly or annual.
Usage restrictions are easy to implement: track API calls or tokens consumed per user in your database and block or downgrade when they hit the limit.
Mongolia
Stripe is not available in Mongolia. Handle payments directly via bank transfer or use local payment providers. For huuli.tech, direct bank payment works fine — Mongolian users are accustomed to this.
Simple Evaluation Loop
Do not skip this. Before launching, you need to know whether your RAG is actually good.
The setup:
- Write 15–20 domain questions with known correct answers
- Run your RAG system on each question
- Ask another LLM to grade the response against the correct answer (1–10 with explanation)
- Manually check: are the retrieved chunks actually relevant? Are citations correct?
from openai import OpenAI
client = OpenAI()
EVAL_QUESTIONS = [
{
"question": "What is the penalty for dispensing without a prescription under Article 24?",
"correct_answer": "A fine of not more than 300,000 yen under Article 86.",
},
# ... more questions
]
def grade(question: str, rag_answer: str, correct: str) -> dict:
prompt = f"""Grade the following AI answer on a scale of 1–10.
Question: {question}
Correct answer: {correct}
AI answer: {rag_answer}
Return JSON: {{"score": <1-10>, "explanation": "<brief reason>"}}"""
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
)
return resp.choices[0].message.content
for item in EVAL_QUESTIONS:
result = answer(item["question"], conn)
grade_result = grade(item["question"], result["answer"], item["correct_answer"])
print(f"Q: {item['question'][:60]}...")
print(f"Score: {grade_result}")
print()
Run this eval loop every time you change your chunking strategy or prompt. It takes 10 minutes and catches regressions before users do.
Advanced RAG Patterns (Do Not Build These First)
Get the basic RAG working and validated first. Then consider these if you need to improve quality:
Reranker
Embeddings are fast but approximate. A reranker model reads the actual text of the query and each candidate chunk and scores their relevance precisely. The workflow becomes:
Embedding search → top 50 candidates → reranker → top 10 → LLM
Cohere Rerank and cross-encoder/ms-marco-MiniLM-L-6-v2 (free, local) are good options.
Router
A cheap LLM at the gate classifies the query before retrieval. Useful because you do not want to run retrieval on "hello" or "thanks". It also lets you route to different corpora — e.g., statutes vs. case law vs. commentary.
Web Search
For queries about recent events not in your corpus, a simple web search fallback (Serper API, Tavily) prevents the "I don't have that information" dead-end.
AI Beyond RAG
RAG is the starting point — it validates that there is demand and that users trust your system's answers. Once you have that, more advanced patterns open up.
AI Document Reviewer
Instead of answering questions, the AI reads a document the user uploads, adds comments like a senior colleague would, and highlights issues. Think Google Docs comments, but generated by AI trained on your domain.
This is one of the best uses of current AI capability — it is genuinely useful and reliably correct when grounded in domain knowledge. Both huuli.tech and roppolab.jp have implemented this for contract review and exam answer review respectively.
Multi-Agent Deep Researcher
For complex questions requiring synthesis across many documents, a multi-agent system takes multiple passes — retrieving, reasoning, identifying gaps, retrieving again — until it produces a comprehensive research summary.
This is more complex to build and harder to evaluate, but it is genuinely powerful for professional research workflows.
AI Contract Drafting
AI can draft contracts and documents. In my opinion, the current models are not reliable enough for this in professional settings without heavy human review. The risk of subtle errors in legal documents is too high. Focus on review and Q&A before you build drafting tools.
Analytics: Track What Matters
Use PostHog (free tier is generous). Instrument two things immediately:
- What questions users are asking (tells you what content to improve)
- Thumbs up/down on responses (tells you where the RAG is failing)
That feedback loop is how you iterate from "good enough" to "genuinely better than anything else."
Marketing and Distribution
Building is 20% of the work. The rest is getting people to use it.
SEO — Let Your AI Write About Itself
For both huuli.tech and roppolab, we use the AI to generate articles on important topics in the domain — one article per day, automatically, based on what questions users are asking.
This is the flywheel: users ask questions → you identify common topics → AI writes articles → articles rank on Google → more users find the product → more questions → repeat.
The content is accurate because the AI is grounded in your corpus. It naturally cites relevant laws and regulations. Google rewards this.
Social Media Automation
For distribution in communities where your target users are active:
- Facebook groups — post in relevant professional groups, send friend requests, start conversations. Legal and medical professional communities in Mongolia and Japan are highly active on Facebook
- Twitter/X automation — automated posting of domain insights. See @roppo_lab as an example — it posts automatically about Japanese law topics
The content for these posts can be generated by the same system that writes your articles. One pipeline, multiple distribution channels.
Summary: The Full Picture
RAG is the fastest path from "I know this domain deeply" to "I have a product that professionals will pay for." The AI model is a commodity. Your domain corpus, your chunking strategy, and your community trust are not.
Start narrow. Validate early. Use every free resource available. Ship before it is perfect.
The three products linked in this article are live and serving real users. If you are building something similar and want to talk through your approach — reach out to the SoduraAI team.