From MiniSearch to RAG - Blog Search Enhancement

2026년 2월 14일오전 10:00

“요구사항 문서”는 보통 기능 명세처럼 보이지만, 사실은 운영 중에 겪은 실패의 기록에 가깝습니다.

이번 글에서는 현재 Astro 블로그에 적용한 RAG 시스템의 요구사항이 왜 필요해졌는지, 그리고 그 출발점이었던 MiniSearch + LLM 프롬프트 주입 방식에서 어떤 한계를 겪었는지 공유합니다. 또한 requirements/design/tasks 문서를 실제 코드로 옮긴 MVP(Task 1~7) 구현 과정까지 코드와 함께 공유합니다.

시작점: MiniSearch + Gemini 조합은 왜 매력적이었나

초기 구조는 단순했습니다.

정적 블로그 글을 MiniSearch로 색인한다.
사용자 질문이 오면 키워드 기반으로 문서를 몇 개 찾는다.
찾은 본문 일부를 LLM 프롬프트 앞에 붙여서 답변을 생성한다.

이 방식은 구현 속도가 빠르고, 인프라 비용이 거의 없다는 장점을 가집니다. 특히 개인 블로그처럼 문서 수가 많지 않은 환경에서는 “충분히 괜찮아 보이는” 결과가 자주 나옵니다. 하지만 사용자 질문이 길어지고, 표현이 다양해지고, 코드 맥락을 요구하기 시작하면 문제가 발생했습니다.

1) 키워드가 맞아야만 찾는 구조의 한계

MiniSearch는 기본적으로 키워드 매칭입니다. 사용자가 “reconcile phase” 대신 “파이버 비교 단계”라고 물으면 관련 글을 놓칩니다. 즉, 영문/한글 혼합 표현, 축약어, 문맥적 유사어에 취약하며 질문의 의미가 아닌 표현 문자열에 크게 의존했습니다.

2) 프롬프트 주입 컨텍스트의 품질 불안정

MiniSearch 검색 결과를 그대로 프롬프트에 붙이면 질문과 무관한 문단까지 함께 들어가 LLM이 판단하기 어려워집니다. 또한 코드 블록/링크 맥락이 깨져 근거 인용이 부정확해집니다. 결국 “검색 → 주입”은 했지만, 주입되는 컨텍스트의 밀도와 정합성이 보장되지 않았습니다.

3) 출처 표기와 UI 계약 유지의 어려움

기존 UI(LLMSearchModal)는 스트리밍 응답과 sources 포맷을 기대합니다. 초기에는 검색 결과와 답변이 느슨하게 연결되어서, 실제로 사용되지 않은 문서가 출처에 뜨거나 반대로 답변에 반영된 문서가 누락되기도 했습니다. 따라서 사용자 입장에서는 “이 답변이 진짜 해당 블로그 글 기반인지” 신뢰하기 어려웠습니다.

4) 운영 관점에서 재현성과 관측 가능성 부족

문제가 생겼을 때 원인 분석 또한 어려웠습니다. 어떤 쿼리에서 어떤 문서를 붙였는지, 유사도 기준으로 왜 탈락했는지, 장애 시 왜 fallback 되었는지 파악하기 어려웠습니다.

design 문서에서 확정한 핵심 구현 의사결정

requirements 문서가 “왜 필요한가”를 설명했다면, design 문서는 “어떻게 구현할 것인가”를 구체화했습니다. 아래는 design에서 내린 핵심 결정들과, 그 결정을 내리게 된 배경입니다.

Note: 아래는 설계 시점의 목표 아키텍처입니다. MVP(Task 1~7)에서는 InMemoryVectorStore + prebuilt index 전략으로 먼저 구현했으며, Upstash Vector, Hybrid Search(RRF), 캐시, 증분 인덱싱 등은 다음 단계에서 적용 예정입니다. 각 항목에서 MVP 구현 현황을 별도로 표기했습니다.

1) 아키텍처: MiniSearch 제거가 아닌 Hybrid Search

설계의 핵심은 대체(replace)가 아니라 결합(combine)이었습니다.

MiniSearch는 빠른 키워드 매칭에 여전히 강합니다. 사용자가 정확한 용어를 입력하면 즉시 결과를 돌려주는 이 속도를 버릴 이유가 없었습니다. 대신 MiniSearch가 놓치는 의미 검색 영역을 벡터 검색으로 보강하고, 두 결과를 임계값 필터 후 RRF(Reciprocal Rank Fusion)로 병합하는 방식을 목표로 설정했습니다. 즉, 기존 장점을 버리지 않고 recall/precision을 함께 개선하는 방향입니다.

MVP 현황: 현재는 semantic 검색(InMemoryVectorStore) 단독 경로로 동작하고, MiniSearch는 RAG 실패 시 fallback으로만 사용됩니다. keyword + semantic 병렬 실행 및 RRF 병합은 Task 8에서 구현 예정입니다.

2) 기술 스택 선택 이유를 명문화

design에서는 스택 선택뿐 아니라 “왜 이것인지, 왜 다른 것은 아닌지”까지 명문화했습니다.

Vercel AI SDK는 이미 기존 스트리밍 파이프라인에서 사용 중이었기 때문에, 새 파이프라인을 추가하더라도 인터페이스가 자연스럽게 결합됩니다. 벡터 스토어는 Upstash Vector를 목표로 설정했고, 서버리스 환경과 궁합이 좋고 무료 티어가 있어 개인 블로그 규모에서 비용 부담이 없다는 점이 선택 이유였습니다. 임베딩 모델은 gemini-embedding-001(768-dim)로 고정해서 속도와 비용 사이 균형을 잡았습니다.

반면 LangChain은 복잡도와 번들 크기 부담 때문에 배제했습니다.

MVP 현황: 벡터 스토어는 Upstash 대신 InMemoryVectorStore + 사전 빌드 인덱스(rag-index.json) 전략으로 먼저 구현했습니다. 블로그 규모에서는 인메모리 방식이 충분히 동작하며, 문서 수가 늘어나면 Upstash로 전환할 계획입니다.

3) 가장 중요한 결정: 인덱싱 파이프라인 분리

astro build와 인덱스 동기화를 분리해 배포 안정성을 확보했습니다. 별도 sync-rag-index 스크립트로 독립 실행하고, 수동/스케줄/CI 중 선택할 수 있도록 구성했습니다. 이렇게 분리한 이유는 명확합니다. 네트워크 불안정이나 임베딩 API 장애가 웹 배포 자체를 막아서는 안 되기 때문입니다.

스크립트 수준의 분리만으로는 부족했다

package.json에서 build와 sync-rag-index를 별도 스크립트로 분리한 것은 첫 단계였습니다.

"build": "astro check && astro build && jampack ./dist"
"sync-rag-index": "npx tsx scripts/sync-rag-index.ts"

그러나 CI(deploy.yml)에서는 같은 job 안에서 sync-rag-index → vercel build → vercel deploy를 순차 실행하고 있었기 때문에, 임베딩 API 장애가 빌드/배포를 차단하는 구조였습니다. 스크립트가 분리되어 있어도 실행 흐름이 결합되어 있으면 장애 격리가 되지 않는다는 것을 알게 되었습니다.

GitHub Workflows 수준의 분리

이 문제를 해결하기 위해 단일 워크플로우를 인덱싱 워크플로우(rag-index.yml)와 배포 워크플로우(deploy.yml) 두 개로 분리했습니다.

	rag-index.yml (인덱싱 전담)	deploy.yml (빌드/배포 전담)
트리거 ①	push (콘텐츠 경로만)	push (콘텐츠 경로 제외)
트리거 ②	workflow_dispatch (수동)	workflow_run (인덱싱 완료 후)

인덱싱 워크플로우는 블로그 글(src/content/blog/**, src/content/rag/**)이 변경될 때만 트리거됩니다. 생성된 rag-index.json은 GitHub Actions artifact로 업로드하고, 배포 워크플로우는 dawidd6/action-download-artifact로 가장 최근 성공한 인덱싱 run의 artifact를 다운로드해서 사용합니다.

이중 배포 방지: paths-ignore + workflow_run

# deploy.yml
on:
  push:
    branches: [master]
    paths-ignore:
      - "src/content/blog/**"
      - "src/content/rag/**"
      - "scripts/sync-rag-index.ts"
  workflow_run:
    workflows: ["RAG Index"]
    types: [completed]
    branches: [master]

콘텐츠만 변경된 push에서는 deploy.yml이 직접 트리거되지 않고, rag-index.yml 완료 후 workflow_run을 통해서만 배포됩니다. 코드와 콘텐츠가 동시에 변경되면 양쪽 모두 트리거되지만, concurrency: { group: deploy, cancel-in-progress: true }가 최종 1회만 완료되도록 보장합니다.

이 구조 덕분에 시나리오별로 아래처럼 동작합니다.

시나리오	rag-index.yml	deploy.yml
코드만 변경	트리거 안 됨	push 트리거 → 이전 artifact 재사용 → 배포
글만 변경	임베딩 생성 → artifact 업로드	push 트리거 안 됨 → workflow_run 배포(fresh)
코드+글 동시 변경	임베딩 생성 → artifact 업로드	push 트리거 배포 → workflow_run 재배포(concurrency로 1회만 완료)
임베딩 API 장애	실패	이전 artifact로 정상 배포 (차단 안 됨)
최초 배포 (bootstrap)	—	artifact 없음 → inline fallback 생성

코드와 글이 같은 push에 포함되면 deploy.yml이 push 트리거로 먼저 실행된 후 workflow_run으로 재실행되어 첫 실행이 취소됩니다. 이는 기능적 문제가 아니라 로그상의 노이즈이며, 최종 배포는 항상 fresh 인덱스로 정상 완료됩니다.

왜 deploy.yml의 push 트리거를 제거하지 않았는가

더 단순한 대안으로 “모든 push에서 rag-index.yml만 실행하고, deploy.yml은 workflow_run으로만 트리거”하는 구조를 고려할 수 있습니다. 이렇게 하면 코드+글 동시 변경에서도 취소 없이 깔끔하게 동작합니다. 하지만 이 구조에는 치명적인 문제가 있습니다. 모든 배포가 rag-index.yml 성공에 의존하게 되어, 임베딩 API 장애 시 코드 변경까지 배포가 차단됩니다. 워크플로우를 분리한 핵심 목적이 “임베딩 장애와 배포의 격리”였으므로, deploy.yml의 push 트리거는 반드시 유지해야 합니다.

핵심은 임베딩 API가 죽어도 배포는 차단되지 않고, 콘텐츠만 변경했을 때 stale 인덱스로 배포되는 일도 없다는 것입니다. 스크립트 분리가 “빌드 명령어의 독립”이었다면, 워크플로우 분리는 “실행 흐름의 독립”까지 달성한 것입니다.

4) 코드 블록 처리 원칙 분리

개발 블로그에서는 코드가 중요한 근거입니다. 같은 글이라도 “키워드로 빠르게 찾는 것”과 “LLM에게 정확한 근거를 주는 것”은 요구하는 데이터 형태가 다르기 때문에, 동일한 블로그 글을 인덱싱할 때 검색 목적에 따라 코드 블록을 다르게 처리합니다.

MiniSearch 인덱스를 생성하는 search-index.json.ts에서는 stripMarkdown() 함수로 코드 블록을 제거한 뒤 인덱싱합니다. 키워드 매칭에서 코드는 노이즈에 가깝기 때문입니다.

// src/pages/search-index.json.ts
function stripMarkdown(md: string) {
  return md
    .replace(/```[\s\S]*?```/g, " ") // 멀티라인 코드 블록 제거
    .replace(/`[^`]*`/g, " "); // 인라인 코드 제거
  // ...
}
const content = stripMarkdown(post.body ?? "");

반면 RAG 임베딩에서는 document-loader.ts가 post.body를 가공 없이 그대로 사용합니다. LLM이 코드를 근거로 답변을 생성해야 하므로 원본 보존이 필요합니다.

// src/lib/rag/document-loader.ts
return {
  // ...
  content: post.body ?? "", // 원본 그대로 — 코드 블록 포함
};

정리하면 같은 글이라도 파이프라인별로 서로 다른 형태의 데이터가 인덱싱됩니다.

	MiniSearch	RAG
처리 방식	`stripMarkdown(post.body)`	`post.body` (원본)
코드 블록	제거	보존
이유	키워드 검색에서 코드는 노이즈	LLM 답변의 근거로 코드가 필요
결과물	`search-index.json`	`rag-index.json`

5) 점수 정규화와 품질 필터링

RRF는 rank 기반이라 저품질 결과가 섞이면 전체가 오염되기 쉽습니다. 그래서 병합 전에 반드시 임계값 필터를 먼저 거치도록 순서를 강제했습니다. semantic score는 0.6 이상, keyword score는 0.5 이상인 결과만 통과시키고, 필터를 통과한 결과만 RRF로 병합한 뒤 top K를 반환합니다. 이 순서를 지키지 않으면 유사도가 낮은 문서가 rank만으로 상위에 올라오는 문제가 생기기 때문에, 필터 → 병합 → 반환 순서를 설계 단계에서 고정했습니다.

MVP 현황: 현재는 semantic score 0.6 임계값 필터만 적용 중입니다. keyword score 필터와 RRF 병합은 Hybrid Search(Task 8) 도입 시 함께 구현됩니다.

6) 출처 정확도 개선의 단계적 적용

스트리밍 특성상 “답변이 끝나기 전에 어떤 출처가 실제로 인용됐는지” 판단하기 어렵습니다. 그래서 출처 정확도 개선은 두 단계로 나눴습니다. Phase 1에서는 검색된 source를 전체 반환해 빠르게 적용하고, Phase 2에서는 응답 텍스트에서 (출처 N) 마커를 파싱해 실제 인용된 source만 필터링합니다. 완벽한 정확도를 첫 릴리스에 맞추려다 전체 일정이 밀리는 것보다, 단계적으로 개선하는 편이 현실적이라고 판단했습니다.

7) Timeout/Cache 설계

런타임 안정성을 위해 시간과 캐시 정책도 명시적으로 정의했습니다. Vector query timeout은 1000ms, 전체 검색 예산(Total search budget)은 2000ms로 잡았고, 캐시 키는 {query}:{indexVersion} 형식에 TTL 5분을 적용합니다. 이 예산을 초과하면 자동으로 MiniSearch fallback이 동작하도록 설계했습니다. 이렇게 시간 제한을 명시해 두면 “느린 검색”이 사용자 경험을 해치기 전에 차선책으로 전환할 수 있습니다.

MVP 현황: timeout과 캐시는 아직 구현되지 않았습니다. 현재는 RAG 검색이 실패(예외)하면 MiniSearch fallback이 동작하는 구조이며, timeout budget 기반 자동 전환과 쿼리 캐시는 Task 9, 14에서 구현 예정입니다.

Correctness Properties를 정의한 이유

이번 design 문서에서 특히 유효했던 부분은 “정확성 속성(property)“입니다.

요구사항이 선언이라면, property는 검증 기준입니다. 예를 들어 “문서 로딩 수 = 컬렉션 문서 수” 같은 속성은 로더가 문서를 누락하지 않았는지 확인하는 기준이 되고, “RAG 실패 시 graceful fallback” 속성은 장애 테스트의 합격 조건이 됩니다. 이 외에도 문서 metadata 완전성 보장, top-K 제한, API 호환(prompt/query, source marker), cache key에 indexVersion 반영 등을 명시했습니다.

이렇게 명시해야 구현/테스트/운영이 같은 기준으로 움직일 수 있습니다.

테스트 전략도 함께 문서화한 이유

RAG는 기능 하나가 아니라 파이프라인입니다. 그래서 테스트도 단계화했습니다.

Phase 0~4 롤아웃 계획을 먼저 세우고, 각 단계마다 계약 테스트(응답 포맷/스트리밍 마커), fallback 테스트(장애 시 MiniSearch 전환), property-based 테스트(핵심 불변성 검증)를 배치했습니다. 핵심은 “잘 동작한다”가 아니라, “망가져도 예측 가능하게 동작한다” 를 검증하는 것입니다.

핵심 교훈: RAG 도입의 본질은 모델 교체가 아니라 운영 규약 수립

많은 경우 RAG는 “벡터 DB 붙이기”로 설명되지만, 실제로는 다음이 더 중요했습니다.

어떤 데이터를 어떤 단위로 넣는가
어떤 기준으로 검색 결과를 채택/제외하는가
실패했을 때 어떤 경로로 서비스 품질을 보장하는가
기존 UI/스트리밍 계약을 어떻게 유지하는가

요구사항 문서는 이 운영 규약을 정의했고, design 문서는 그 규약을 실행 가능한 구조로 바꿨습니다.

실제 구현 기록: MVP Task 1~7

아래는 tasks.md 기준으로, 실제로 구현된 내용을 코드와 함께 정리한 내용입니다.

Task 1. Phase 0 - 안전한 기반 작업

1.1 환경변수/설정 로더

먼저 feature flag와 기본 파라미터를 안전하게 읽는 설정 로더를 만들었습니다. src/lib/rag/config.ts에서 RAG_ENABLED, topK, 유사도 임계값, 임베딩 배치 크기를 로딩하며, 값이 없거나 잘못된 경우 기본값으로 폴백하도록 구성했습니다. 이렇게 설정해둠으로써 잘못된 env 값이 들어와도 즉시 장애로 번지지 않고 기본 값으로 동작합니다.

// src/lib/rag/config.ts
export function getRAGConfig(): RAGConfig {
  return {
    enabled: import.meta.env.RAG_ENABLED === "true",
    embeddingModel: normalizeEmbeddingModel(
      import.meta.env.RAG_EMBEDDING_MODEL
    ),
    chunkSize: getNumber(import.meta.env.RAG_CHUNK_SIZE, 700),
    chunkOverlap: getNumber(import.meta.env.RAG_CHUNK_OVERLAP, 120),
    topK: getNumber(import.meta.env.RAG_TOP_K, 5),
    similarityThreshold: getNumber(
      import.meta.env.RAG_SIMILARITY_THRESHOLD,
      0.6
    ),
    embeddingBatchSize: getNumber(
      import.meta.env.RAG_EMBEDDING_BATCH_SIZE,
      100
    ),
  };
}

1.2 벡터 스토어 추상화

MVP에서는 우선 src/lib/rag/vector-store.ts에 VectorStore 인터페이스를 두고, InMemoryVectorStore 구현을 사용하는 방식으로 시작했습니다. Upstash 우선 전략은 유지하되, MVP에서는 local/prod 공통으로 빠르게 검증 가능한 구조를 택했습니다. query는 코사인 유사도로 정렬된 topK를 반환합니다.

설계 문서에서는 Upstash Vector를 1순위로 뒀지만, MVP 안전성을 위해 현재 브랜치에서는 인메모리 store + prebuilt index(public/rag-index.json) 전략으로 먼저 고정했습니다.

// src/lib/rag/vector-store.ts
export interface VectorStore {
  upsert(chunks: EmbeddedChunk[]): Promise<void>;
  query(
    queryEmbedding: number[],
    options: { topK: number }
  ): Promise<SemanticHit[]>;
  size(): number;
}

export class InMemoryVectorStore implements VectorStore {
  private readonly store = new Map<string, EmbeddedChunk>();

  async query(queryEmbedding: number[], options: { topK: number }) {
    return Array.from(this.store.values())
      .map(chunk => ({
        chunk,
        score: cosineSimilarity(queryEmbedding, chunk.embedding),
      }))
      .sort((a, b) => b.score - a.score)
      .slice(0, options.topK);
  }
}

1.3 로깅/메트릭 유틸

RAG 파이프라인은 구간이 길기 때문에 최소한의 구조화 로그를 먼저 넣었습니다.

// src/lib/rag/logger.ts
export const ragLogger = {
  info: (message: string, context?: Record<string, unknown>) =>
    write("info", message, context),
  warn: (message: string, context?: Record<string, unknown>) =>
    write("warn", message, context),
  error: (message: string, context?: Record<string, unknown>) =>
    write("error", message, context),
};

Task 2. Phase 1A - 문서 처리 파이프라인

2.1 문서 로더 구현

문서 수집은 Astro Content Collection의 blog 문서를 로드한 것과 custom 문서(src/content/rag/custom-documents.json)를 로드한 것을 병합해 RAG 입력으로 만듭니다. blog 로더에서는 draft를 제외하고, title, description, tags, url, content를 추출하도록 맞췄습니다.

// src/lib/rag/document-loader.ts
export async function loadRAGDocuments(): Promise<RAGDocument[]> {
  const [blogDocs, customDocs] = await Promise.all([
    loadBlogDocuments(),
    loadCustomDocuments(),
  ]);

  return [...blogDocs, ...customDocs];
}

2.x 청킹에 대한 의사결정

초기에는 heading + size/overlap 기반 청킹 모듈(src/lib/rag/chunking.ts)을 작성하고, 문서를 세분화하면 검색 정밀도가 올라갈 것으로 예상했습니다. 그러나 eval 스크립트를 만들어 실측한 결과, 청킹이 오히려 성능을 떨어뜨린다는 결론에 도달했습니다. 이 내용은 아래 “Eval: Document-level vs Chunked RAG” 섹션에서 상세히 다룹니다.

결론적으로 현재는 1문서 = 1임베딩(Document-level RAG) 전략을 확정 적용했으며, 청킹 모듈은 제거되었습니다.

Task 3. Phase 1A - 임베딩 생성

3.1 임베딩 서비스

배치 임베딩 + 재시도(2s/4s/8s)로 API 불안정성을 흡수합니다.

// src/lib/rag/embeddings.ts
for (let attempt = 0; attempt <= backoffMs.length; attempt += 1) {
  try {
    const result = await embedMany({
      // Vercel AI SDK `embedMany()` 사용
      model,
      values: batch.map(chunk => chunk.text), // 배치 처리
    });
    // ... push embeddings
    break;
  } catch (error) {
    if (attempt < backoffMs.length)
      await sleep(backoffMs[attempt]); // 지수 백오프(2s, 4s, 8s) 재시도
    else throw error;
  }
}

3.3/3.4 sync 스크립트

sync-rag-index 스크립트로 blog, custom 문서를 임베딩하여 사전 임베딩 파일(public/rag-index.json)을 생성합니다. package.json의 sync-rag-index 스크립트 및 build 파이프라인에 연결해 배포 시 사전 색인이 가능하도록 했습니다.

// scripts/sync-rag-index.ts
const documents = allDocs.map(doc => ({
  id: doc.id,
  docId: doc.id,
  text: [
    doc.publishedAt ? `Published: ${doc.publishedAt}` : "",
    doc.title,
    doc.description,
    doc.content,
  ]
    .filter(Boolean)
    .join("\n\n"),
  metadata: {
    title: doc.title,
    ...(doc.titleEn ? { titleEn: doc.titleEn } : {}),
    tags: doc.tags ?? [],
    url: doc.url,
    ...(doc.publishedAt ? { publishedAt: doc.publishedAt } : {}),
  },
}));

// 배치 단위로 임베딩 생성
for (let i = 0; i < documents.length; i += batchSize) {
  const batch = documents.slice(i, i + batchSize);
  const result = await embedMany({
    model,
    values: batch.map(d => d.text),
  });
  allEmbeddings.push(...result.embeddings);
}

await writeFile(outFile, JSON.stringify(embedded), "utf-8");

publishedAt 메타데이터를 임베딩 텍스트와 메타데이터 양쪽에 포함시켜 최신성 인식(recency-aware) 검색이 가능하도록 했습니다.

Task 5. Phase 1B - 의미 검색/컨텍스트 구성

5.1 의미 검색

쿼리 임베딩 후 유사도 임계값으로 필터링합니다.

// src/lib/rag/semantic-search.ts
const hits = await vectorStore.query(embedding, { topK: options.topK }); // 쿼리 임베딩 생성 후 vector store 조회
return hits.filter(hit => hit.score >= options.similarityThreshold); // 유사도 임계값 필터 적용

5.3 컨텍스트 포맷터

같은 URL의 결과를 병합해서 중복 컨텍스트를 줄이고, 프롬프트에 (출처 N) 규칙을 포함합니다.

// src/lib/rag/context-formatter.ts
const key = hit.chunk.metadata.url;
const entry = merged.get(key) ?? { title, url, texts: [] };
entry.texts.push(hit.chunk.text); // UI에 넘길 source 배열 생성

Task 6. Phase 1C - `/api/search` 통합

핵심은 “RAG ON/OFF + 장애시 즉시 fallback + 기존 계약 유지”였습니다. 아래와 같이 구현함으로써 내부 구현을 바꿔도 프론트(LLMSearchModal)가 기대하는 스트리밍/소스 포맷은 그대로 유지됩니다. 즉, 백엔드 내부는 바뀌어도 프론트 계약은 그대로 유지되도록 구현했습니다.

// src/pages/api/search.ts
try {
  if (isRAGEnabled()) {
    // `RAG_ENABLED` feature flag로 경로 전환
    const rag = await runRAGSearch(prompt, {
      apiKey,
      originRequestUrl: request.url,
    });
    sourcesForClient = rag.sources;
    llmPrompt = rag.prompt;
  } else {
    // MiniSearch path
  }
} catch (error) {
  console.warn("RAG search failed; falling back to MiniSearch", error);
  // MiniSearch fallback
}

6.5 `src/lib/rag/index.ts` 심화 설명 (핵심 런타임 오케스트레이션)

개인적으로 이 파일은 “RAG 엔진의 컨트롤 타워”에 가깝습니다. 설정(getRAGConfig)을 읽고, 인덱스 적재 상태를 확인하고, 필요 시 인제스트를 수행하고, semantic 검색 → 프롬프트/소스 변환까지 한 번에 조립합니다.

A. 모듈 레벨 상태: `vectorStore`, `isIngested`

const vectorStore = new InMemoryVectorStore();
let isIngested = false;

vectorStore: 서버 런타임 인스턴스 메모리에 벡터를 보관합니다.
isIngested: 같은 인스턴스에서 중복 인제스트를 막는 가드입니다.
- 첫 요청에서만 인제스트를 수행하고
- 이후 요청에서는 바로 검색 단계로 넘어갑니다.

B. prebuilt 우선 로딩: `loadPrebuiltIndex()`

async function loadPrebuiltIndex(_originRequestUrl: string) {
  try {
    const filePath = join(process.cwd(), "rag-index.json");
    const raw = await readFile(filePath, "utf-8");
    const chunks = JSON.parse(raw) as EmbeddedChunk[];
    if (chunks.length === 0) return null;
    return chunks;
  } catch {
    return null;
  }
}

이 함수는 **“런타임 임베딩 전에 prebuilt를 먼저 시도”**한다는 전략을 구현합니다. 파일 시스템에서 직접 rag-index.json을 읽어 첫 요청 지연/비용을 크게 줄일 수 있으며, 파일이 없거나 빈 인덱스면 null을 반환 후 다음 경로로 진행합니다.

C. 문서→임베딩 단위 변환: `toDocumentChunks()`

return loadRAGDocuments().then(docs =>
  docs.map(doc => ({
    id: doc.id,
    docId: doc.id,
    text: `${doc.title}\n\n${doc.description}\n\n${doc.content}`,
    metadata: {
      title: doc.title,
      ...(doc.titleEn ? { titleEn: doc.titleEn } : {}),
      tags: doc.tags,
      url: doc.url,
    },
  }))
);

Document-level RAG 전략으로, title + description + content를 하나의 텍스트로 결합해 문서 단위로 임베딩합니다. 다국어 출처 표시를 위해 titleEn도 메타데이터에 포함합니다. 처음에는 이 방식을 MVP 임시 전략으로 생각했지만, eval 결과 청킹보다 문서 단위 임베딩이 hit rate/MRR 모두 우수하다는 것이 확인되어 최종 아키텍처로 확정했습니다. (Ref: https://github.com/Hanna922/hanna.dev/pull/19)

D. 인제스트 게이트: `ingestIfNeeded()`

이 함수가 실제로 “한 번만 인덱스 준비”를 보장합니다.

isIngested가 true면 즉시 return
prebuilt 로딩 시도
prebuilt 성공 시 upsert + 종료
prebuilt 실패 시 런타임 임베딩

if (isIngested) return;

const prebuilt = await loadPrebuiltIndex(originRequestUrl);
if (prebuilt) {
  await vectorStore.upsert(prebuilt);
  isIngested = true;
  ragLogger.info("RAG prebuilt index loaded", { chunks: prebuilt.length });
  return;
}

const embedded = await embedChunks(chunks, {
  apiKey,
  model: config.embeddingModel,
  batchSize: config.embeddingBatchSize,
});
await vectorStore.upsert(embedded);
isIngested = true;

이 구조 덕분에, 운영 중에는 대부분 “prebuilt 즉시 로드 → 검색” 경로를 타게 됩니다.

E. 최종 실행 함수: `runRAGSearch()`

const config = getRAGConfig();
await ingestIfNeeded(options.apiKey, options.originRequestUrl);

const hits = await semanticSearch(query, vectorStore, {
  apiKey: options.apiKey,
  model: config.embeddingModel,
  topK: config.topK,
  similarityThreshold: config.similarityThreshold,
});
const localizedHits = filterHitsByLocale(hits, options.locale ?? "ko");

return {
  hits: localizedHits,
  prompt: buildPromptWithContext(query, localizedHits, options.locale ?? "ko"),
  sources: toSourceRefsFromSemanticHits(localizedHits),
};

정리하면 이 함수는 아래 4단계를 고정합니다.

인덱스 준비 보장
semantic 검색
locale 기반 필터링 (한/영 글 중 사용자 언어 우선)
LLM 입력/출력 포맷 변환

즉, src/pages/api/search.ts에서 RAG 경로를 탈 때 “한 번에 호출 가능한 엔드포인트 함수” 역할을 합니다.

Task 7. Checkpoint - End-to-End 동작 확인

현재 MVP(Task7) 기준으로는 아래 흐름이 완성되어 있습니다.

sync-rag-index로 사전 임베딩 인덱스 생성
서버 런타임에서 prebuilt index 우선 로드 (src/lib/rag/index.ts)
RAG_ENABLED=true 시 semantic 검색 + 컨텍스트 프롬프트 생성
실패 시 MiniSearch fallback
기존 UI 스트리밍 표시/출처 렌더링 유지

인덱스 로딩 우선순위도 코드로 명시돼 있습니다.

// src/lib/rag/index.ts
const prebuilt = await loadPrebuiltIndex(originRequestUrl);
if (prebuilt) {
  await vectorStore.upsert(prebuilt);
  isIngested = true;
  return;
}

// prebuilt 없으면 런타임 임베딩
const embedded = await embedChunks(chunks, { apiKey, model, batchSize });
await vectorStore.upsert(embedded);

이 단계까지가 “동작하는 첫 번째 RAG”를 만드는 목표였습니다.

Eval: Document-level vs Chunked RAG

MVP 구현 후, “문서 단위 임베딩이 정말 최선인가?”라는 질문에 답하기 위해 eval 스크립트(scripts/run-eval.ts)를 작성하고 36개 평가 항목으로 3라운드 실험을 진행했습니다.

실험 설계

36개 eval 항목은 project-motivation, project-detail, concept, cross-post, negative 등 카테고리별로 분류했으며, 각 항목에 대해 Chunked RAG(284청크), Document-level RAG(38문서), MiniSearch를 동시에 평가했습니다. 핵심 지표는 Hit Rate @5와 MRR(Mean Reciprocal Rank)입니다.

3라운드 결과 요약

라운드	변경 사항	Chunked Hit Rate	Document-level Hit Rate	Delta
1차	기본 비교	87.9%	97.0%	-9.1%
2차	+ 문서 단위 중복 제거	90.9%	97.0%	-6.1%
3차	청킹 제거(Document-level 확정)	—	97.0%	0

청킹이 Document-level을 이긴 라운드는 없었습니다. 2차에서 중복 제거를 적용해도 개선 폭이 제한적이었고, 특히 cross-post 카테고리(eval-026~028)에서 청킹 방식이 지속적으로 실패했습니다. 청킹이 문맥을 분절시켜 여러 글에 걸친 질문에 대응하지 못한 것이 원인이었습니다.

최종 성능 비교 (Document-level RAG vs MiniSearch)

Metric	Document-level RAG	MiniSearch
Hit Rate @5	97.0%	75.8%
MRR	64.6%	42.5%
Keyword Coverage	93.2%	94.0%
Avg Latency	0.6ms	3.1ms

RAG 도입으로 MiniSearch 대비 Hit Rate +21.2%, MRR +22.1% 개선을 달성했습니다. 38개 문서만 인덱싱하기 때문에 284청크 대비 연산량도 줄어 레이턴시까지 개선되었습니다.

왜 Document-level이 이겼는가

코퍼스 규모가 임베딩 컨텍스트 윈도우 안에 충분히 들어감: 블로그 글 16개(한/영 각 8개 + custom 문서) 수준에서는 한 문서 전체를 임베딩해도 의미 손실이 없습니다. 청킹은 수만 단어급 문서나 한 문서 안에 완전히 이질적인 주제가 섞인 경우에 필요한 기법입니다.
청킹의 문맥 분절 문제: 글을 쪼개면 “이 블로그에서 성능 최적화를 다룬 글이 있나요?” 같은 넓은 범위 질문에서 부분 청크가 top-K를 차지해 정작 관련 문서를 놓치게 됩니다.
연산 효율: 38문서 vs 284청크 — 인덱싱과 검색 모두 더 적은 벡터로 더 좋은 결과를 얻었습니다.

이 결과를 바탕으로 청킹 모듈을 제거하고, Document-level RAG를 최종 아키텍처로 확정했습니다.

앞으로 Develop 할 부분 (다음 단계)

Task7 이후에는 아래를 우선 개발하려고 합니다.

Hybrid Search 정식 도입 (Task8)
- keyword + semantic 병렬 실행
- threshold 필터 후 RRF 병합
- weight 튜닝(0.4 / 0.6)
쿼리 캐시/버전 키 (Task9)
- {query}:{indexVersion} 키 전략
- TTL/무효화 정책 고도화
증분 인덱싱 + Manifest (Task11)
- 현재 sync-rag-index 스크립트는 매 실행 시 전체 문서를 재색인합니다. 문서 수가 38개 수준이라 비용이 크지 않아 우선 이 방식으로 운영 중입니다.
- 목표: 문서 id를 {postId} 형식으로 정의하고, manifest에 문서별 contentHash, lastUpdated를 저장해서 hash가 변경된 문서만 delete/upsert하는 증분 업데이트 전략 도입
- 이 구조가 적용되면 재색인이 idempotent하게 수행되어, 같은 스크립트를 여러 번 돌려도 부작용 없이 변경분만 갱신됩니다.
실패 문서 재처리(DLQ) 및 운영 자동화 (Task12~13)
- 재시도 파이프라인 구축
- 장애 복구 시간 단축
프로덕션 하드닝 (Task14~16)
- 성능 지표(p50/p95), timeout budget 고정
- property/integration 테스트 강화
- 운영 가이드/런북 정리

마무리

MiniSearch 기반 프롬프트 주입은 빠르게 시작하기엔 아주 좋은 선택이었습니다. 다만 정확도, 근거성, 재현성, 비용 통제까지 요구되는 순간부터는 구조적 한계가 명확해졌습니다.

이번 RAG 요구사항 + 설계 문서는 “어떻게 하면 신뢰 가능한 검색-생성 시스템을 오래 운영할 수 있는가” 에 대한 답입니다.

같은 전환을 고민하는 분들이라면, 기술 선택보다 먼저 “왜 이 요구사항이 필요한지”를 실패 사례 관점에서 적어보시길 추천합니다. 그리고 그 다음 단계로, 설계 문서에서 아키텍처/인터페이스/테스트 기준까지 연결해 두신다면, 미래의 장애 대응 속도와 제품 품질을 향상시킬 수 있을 것 같습니다!

From MiniSearch to RAG - Blog Search Enhancement

2026년 2월 14일오전 10:00

The “requirements document” often looks like a simple feature specification, but in practice it is closer to a record of failures I encountered in production.

In this post, I will share why the RAG requirements became necessary for the current Astro blog, what limitations we faced with the original MiniSearch + LLM prompt injection approach, and the full implementation details of MVP (Task 1~7) that turned those requirements and design docs into code.

Starting point: Why MiniSearch + Gemini was initially attractive

The initial setup was intentionally simple:

Index static blog posts using MiniSearch.
When a user asks a question, find a few documents using keyword search.
Append matched snippets to the LLM prompt and generate a response.

This approach is fast to implement and has low infrastructure overhead, so for small blogs it often looks “good enough.”
But once questions become longer, vocabulary varies, or users ask for code context, we started hitting real limits.

1) The limit of exact-match keyword retrieval

MiniSearch is fundamentally keyword matching. If a user asks for “fiber comparison phase” instead of “reconcile phase,” many relevant docs are missed. It is weak with multilingual or mixed-language queries, abbreviations, and semantic equivalents because the retrieval depends heavily on surface text, not semantic meaning.

2) Unstable quality of prompt injection context

If you directly append MiniSearch results to the prompt, unrelated paragraphs can be included, making LLM judgment noisy.
On top of that, code-block and link context is often fragmented, reducing citation accuracy. In other words, we did have “search → injection,” but there was no guarantee on density and consistency of injected context.

3) Output source and UI contract mismatches

The existing UI (LLMSearchModal) expects streaming results and a sources format. At first, search results and answer output were loosely coupled, so some docs appeared in the source list without being used, while used docs were not included in sources.
From the user perspective, it became hard to trust whether the answer truly came from the intended blog content.

4) Poor observability and reproducibility in operation

When incidents happened, root-cause analysis was slow:
which docs were attached for a query, why semantically similar results were filtered, and why fallback kicked in under failure were all difficult to trace.

Key implementation decisions finalized in the design doc

If the requirement doc answered “why this is needed,” the design doc defined “how to implement it.” The decisions below were intentionally explicit.

Note: The following describes the target architecture from the design phase. The MVP (Task 1~7) was implemented with InMemoryVectorStore + prebuilt index strategy. Upstash Vector, Hybrid Search (RRF), caching, and incremental indexing are planned for future phases. Each item includes a note on current MVP status.

1) Architecture: Hybrid Search, not removing MiniSearch

The principle was not replacement, but combination.

MiniSearch is still strong at fast keyword matching. If the user types an exact term, it returns fast and accurately. So we kept it. For meaning-based retrieval gaps, the goal was to add vector search and merge both result sets with RRF (Reciprocal Rank Fusion) after threshold filtering. This keeps existing strengths while improving both recall and precision.

MVP status: Currently, semantic search (InMemoryVectorStore) runs as the sole retrieval path, with MiniSearch used only as a fallback when RAG fails. Parallel keyword + semantic execution with RRF merging is planned for Task 8.

2) Why we chose each technology explicitly

The design doc recorded not only what was picked, but why alternatives were excluded.

Vercel AI SDK was already used in the existing streaming pipeline, so integrating a second stack was low friction. The target vector store was Upstash Vector, chosen for its serverless compatibility and free tier suited to personal-blog scale. For embeddings, we fixed on gemini-embedding-001 (768-dim) to balance cost and latency.

LangChain was excluded due to integration complexity and bundle/runtime weight for this scale.

MVP status: Instead of Upstash, the vector store is implemented as InMemoryVectorStore + prebuilt index (rag-index.json). The in-memory approach works well at blog scale, with Upstash migration planned as the corpus grows.

3) Core decision: decouple the indexing pipeline

To keep deployment stable, we separated astro build from index synchronization. An independent sync-rag-index script can be run manually, scheduled, or through CI. This prevents an external embedding/API issue from breaking regular site deployment.

Script-level separation was not enough

Separating build and sync-rag-index as distinct scripts in package.json was the first step.

"build": "astro check && astro build && jampack ./dist"
"sync-rag-index": "npx tsx scripts/sync-rag-index.ts"

However, in CI (deploy.yml), both ran sequentially within the same job (sync-rag-index → vercel build → vercel deploy), meaning an embedding API outage would block the entire build and deployment. Even with scripts separated, if the execution flow is coupled, failure isolation is not achieved.

GitHub Workflows-level separation

To solve this, we split the single workflow into an indexing workflow (rag-index.yml) and a deploy workflow (deploy.yml).

	rag-index.yml (indexing)	deploy.yml (build/deploy)
Trigger 1	push (content paths only)	push (excluding content paths)
Trigger 2	workflow_dispatch (manual)	workflow_run (after indexing completes)

The indexing workflow triggers only when blog content (src/content/blog/**, src/content/rag/**) changes. The generated rag-index.json is uploaded as a GitHub Actions artifact, and the deploy workflow uses dawidd6/action-download-artifact to download the artifact from the most recent successful indexing run.

Preventing duplicate deploys: paths-ignore + workflow_run

# deploy.yml
on:
  push:
    branches: [master]
    paths-ignore:
      - "src/content/blog/**"
      - "src/content/rag/**"
      - "scripts/sync-rag-index.ts"
  workflow_run:
    workflows: ["RAG Index"]
    types: [completed]
    branches: [master]

Content-only pushes no longer trigger deploy.yml directly — deployment happens only through workflow_run after rag-index.yml completes. When both code and content change simultaneously, both workflows trigger, but concurrency: { group: deploy, cancel-in-progress: true } ensures only the final deploy completes.

This structure results in the following behavior per scenario:

Scenario	rag-index.yml	deploy.yml
Code-only change	Not triggered	push trigger → reuses previous artifact → deploy
Content-only change	Generate embeddings → upload artifact	push trigger skipped → workflow_run deploy (fresh)
Code + content change	Generate embeddings → upload artifact	push trigger deploy → workflow_run redeploy (concurrency ensures only one completes)
Embedding API failure	Fails	Deploys normally with previous artifact (not blocked)
First deploy (bootstrap)	—	No artifact found → inline fallback generation

When code and content are included in the same push, deploy.yml starts via push trigger first, then re-triggers via workflow_run, cancelling the first run. This is not a functional issue but log noise — the final deploy always completes successfully with a fresh index.

Why we kept the push trigger on deploy.yml

A simpler alternative would be to trigger rag-index.yml on all pushes and have deploy.yml trigger only via workflow_run. This would eliminate cancellation noise on mixed pushes. However, this design has a critical flaw: all deployments become dependent on rag-index.yml succeeding. If the embedding API goes down, even code-only changes cannot be deployed. Since the core purpose of workflow separation was to isolate embedding failures from deployment, the push trigger on deploy.yml must be retained.

The key point is that even if the embedding API is down, deployment is never blocked, and content-only changes never deploy with a stale index. If script separation achieved “build command independence,” workflow separation achieves “execution flow independence.”

4) Splitting code-block treatment by search purpose

In technical blogs, code can be essential evidence. Since “finding content quickly by keyword” and “providing accurate evidence to an LLM” require different data representations, code blocks are handled differently depending on the search purpose when indexing the same blog post.

In search-index.json.ts, which generates the MiniSearch index, the stripMarkdown() function removes code blocks before indexing. In keyword matching, code is mostly noise.

// src/pages/search-index.json.ts
function stripMarkdown(md: string) {
  return md
    .replace(/```[\s\S]*?```/g, " ") // Remove multi-line code blocks
    .replace(/`[^`]*`/g, " "); // Remove inline code
  // ...
}
const content = stripMarkdown(post.body ?? "");

In contrast, the RAG embedding pipeline uses post.body as-is in document-loader.ts. The LLM needs code as evidence for its answers, so the original must be preserved.

// src/lib/rag/document-loader.ts
return {
  // ...
  content: post.body ?? "", // Raw original — code blocks preserved
};

In summary, even for the same blog post, each pipeline indexes a different representation of the data.

	MiniSearch	RAG
Processing	`stripMarkdown(post.body)`	`post.body` (raw)
Code blocks	Removed	Preserved
Reason	Code is noise for keyword search	Code is needed as LLM evidence
Output	`search-index.json`	`rag-index.json`

5) Score normalization and quality filters

Because RRF is rank-based, low-quality results can pollute ranking. To prevent this, filtering happens before fusion: semantic score must be >= 0.6, keyword score must be >= 0.5, only then merge via RRF, then return top-K. If we changed the order, low-quality matches could move up just because of rank effects.

MVP status: Currently only the semantic score threshold (0.6) is applied. Keyword score filtering and RRF merging will be implemented alongside Hybrid Search (Task 8).

6) Gradual source-accuracy rollout

In streaming mode, it is hard to know which source was truly cited before answer completion.
So source accuracy was handled in two phases:

Phase 1: return all retrieved sources immediately for fast rollout.
Phase 2: parse (Source N) markers from the final text and retain only truly cited sources.

Trying to get perfect source accuracy in the first release risked schedule delay; phased delivery was the realistic choice.

7) Timeout and cache behavior

Stability is also about runtime boundaries.
We set:

Vector query timeout: 1000ms
Total search budget: 2000ms
cache key: {query}:{indexVersion} with TTL 5min

If these limits are exceeded, fallback to MiniSearch automatically. This avoids poor UX from slow searches by switching paths predictably.

MVP status: Timeout and cache are not yet implemented. Currently, MiniSearch fallback triggers on RAG exceptions. Timeout-budget-based automatic switching and query caching are planned for Task 9 and 14.

Why we defined correctness properties

One of the most practical outcomes was defining correctness properties explicitly.

If the requirement is intent, a property becomes a measurable check.
For example, “loaded document count equals blog document count” verifies no ingestion gaps;
graceful fallback on RAG failure becomes a contract for failure handling.
We also defined completeness of document metadata, top-K enforcement, prompt/query API compatibility, and cache keys including indexVersion.

This alignment makes implementation, QA, and operation move with the same success criteria.

Why we also documented a test strategy

RAG is not a single feature, it is a pipeline. So tests were staged:

Contract tests for response format and streaming markers
Fallback tests for MiniSearch failover
Property-based tests for invariants

The point is not only “works in happy path” but “degrades predictably.”

Main lesson: RAG is not just model substitution

Many people describe RAG as “adding a vector DB.”
The work that mattered more was defining operating rules:

What document units to ingest
Which matches are accepted/rejected
How quality is preserved under failures
How to preserve existing UI and streaming contracts

Requirements defined these operational rules, and the design doc made them executable.

Implementation log: MVP Task 1~7

Below is a practical walkthrough of what was implemented, tied to tasks.md.

Task 1. Phase 0 - Safe foundation

1.1 Env + config loader

We started by hardening settings around feature flags and query defaults in src/lib/rag/config.ts.
RAG_ENABLED, topK, similarity threshold, and embedding batch size are read with safe fallbacks.
Bad environment values no longer break immediately; they fall back to defaults.

// src/lib/rag/config.ts
export function getRAGConfig(): RAGConfig {
  return {
    enabled: import.meta.env.RAG_ENABLED === "true",
    embeddingModel: normalizeEmbeddingModel(
      import.meta.env.RAG_EMBEDDING_MODEL
    ),
    chunkSize: getNumber(import.meta.env.RAG_CHUNK_SIZE, 700),
    chunkOverlap: getNumber(import.meta.env.RAG_CHUNK_OVERLAP, 120),
    topK: getNumber(import.meta.env.RAG_TOP_K, 5),
    similarityThreshold: getNumber(
      import.meta.env.RAG_SIMILARITY_THRESHOLD,
      0.6
    ),
    embeddingBatchSize: getNumber(
      import.meta.env.RAG_EMBEDDING_BATCH_SIZE,
      100
    ),
  };
}

1.2 Vector store abstraction

MVP started with a VectorStore interface in src/lib/rag/vector-store.ts and an InMemoryVectorStore implementation.
Although Upstash remains the direction, MVP verifies behavior with a local/prod-shared in-memory store first.

The design selected Upstash Vector first, but for current MVP stability we fixed on an in-memory store + prebuilt index (public/rag-index.json) implementation first.

// src/lib/rag/vector-store.ts
export interface VectorStore {
  upsert(chunks: EmbeddedChunk[]): Promise<void>;
  query(
    queryEmbedding: number[],
    options: { topK: number }
  ): Promise<SemanticHit[]>;
  size(): number;
}

export class InMemoryVectorStore implements VectorStore {
  private readonly store = new Map<string, EmbeddedChunk>();

  async query(queryEmbedding: number[], options: { topK: number }) {
    return Array.from(this.store.values())
      .map(chunk => ({
        chunk,
        score: cosineSimilarity(queryEmbedding, chunk.embedding),
      }))
      .sort((a, b) => b.score - a.score)
      .slice(0, options.topK);
  }
}

1.3 Logging utility

Because the pipeline is long, we added minimal structured logging from day one.

// src/lib/rag/logger.ts
export const ragLogger = {
  info: (message: string, context?: Record<string, unknown>) =>
    write("info", message, context),
  warn: (message: string, context?: Record<string, unknown>) =>
    write("warn", message, context),
  error: (message: string, context?: Record<string, unknown>) =>
    write("error", message, context),
};

Task 2. Phase 1A - Document pipeline

2.1 Document loader

RAG documents are composed from Astro blog posts and custom docs (src/content/rag/custom-documents.json).
Blog input excludes drafts and extracts title, description, tags, url, and content.

// src/lib/rag/document-loader.ts
export async function loadRAGDocuments(): Promise<RAGDocument[]> {
  const [blogDocs, customDocs] = await Promise.all([
    loadBlogDocuments(),
    loadCustomDocuments(),
  ]);

  return [...blogDocs, ...customDocs];
}

2.2 Decision on chunking

Initially, a heading + size/overlap chunking module (src/lib/rag/chunking.ts) was implemented with the expectation that finer document segmentation would improve retrieval precision. However, after building an eval script and measuring actual performance, chunking turned out to degrade results. This is covered in detail in the “Eval: Document-level vs Chunked RAG” section below.

The final decision was to adopt 1 document = 1 embedding (Document-level RAG) as the confirmed architecture, and the chunking module was removed.

Task 3. Phase 1A - Embedding generation

3.1 Embedding service

Embedding is performed in batches with retries and exponential backoff (2s/4s/8s) to absorb API instability.

// src/lib/rag/embeddings.ts
for (let attempt = 0; attempt <= backoffMs.length; attempt += 1) {
  try {
    const result = await embedMany({
      // Vercel AI SDK `embedMany()` usage
      model,
      values: batch.map(chunk => chunk.text),
    });
    // ... push embeddings
    break;
  } catch (error) {
    if (attempt < backoffMs.length)
      await sleep(backoffMs[attempt]); // exponential backoff: 2s, 4s, 8s
    else throw error;
  }
}

3.3/3.4 Sync script

The sync-rag-index script generates prebuilt embeddings for blog and custom documents into public/rag-index.json. It is wired into package.json and can be integrated with build or CI without impacting runtime path.

// scripts/sync-rag-index.ts
const documents = allDocs.map(doc => ({
  id: doc.id,
  docId: doc.id,
  text: [
    doc.publishedAt ? `Published: ${doc.publishedAt}` : "",
    doc.title,
    doc.description,
    doc.content,
  ]
    .filter(Boolean)
    .join("\n\n"),
  metadata: {
    title: doc.title,
    ...(doc.titleEn ? { titleEn: doc.titleEn } : {}),
    tags: doc.tags ?? [],
    url: doc.url,
    ...(doc.publishedAt ? { publishedAt: doc.publishedAt } : {}),
  },
}));

// Batch embedding generation
for (let i = 0; i < documents.length; i += batchSize) {
  const batch = documents.slice(i, i + batchSize);
  const result = await embedMany({
    model,
    values: batch.map(d => d.text),
  });
  allEmbeddings.push(...result.embeddings);
}

await writeFile(outFile, JSON.stringify(embedded), "utf-8");

publishedAt metadata is included in both the embedding text and metadata, enabling recency-aware search responses.

Task 5. Phase 1B - Semantic search and context assembly

5.1 Semantic search

After embedding a query, we perform vector search and then filter by similarity threshold.

// src/lib/rag/semantic-search.ts
const hits = await vectorStore.query(embedding, { topK: options.topK });
return hits.filter(hit => hit.score >= options.similarityThreshold);

5.3 Context formatter

Results sharing the same URL are merged, and source markers are added as (Source N).

// src/lib/rag/context-formatter.ts
const key = hit.chunk.metadata.url;
const entry = merged.get(key) ?? { title, url, texts: [] };
entry.texts.push(hit.chunk.text);

Task 6. Phase 1C - `/api/search` integration

The main goal was to keep RAG ON/OFF + fast fallback + UI contract unchanged:

internal implementation may switch
response/marker format expected by LLMSearchModal remains stable

// src/pages/api/search.ts
try {
  if (isRAGEnabled()) {
    const rag = await runRAGSearch(prompt, {
      apiKey,
      originRequestUrl: request.url,
    });
    sourcesForClient = rag.sources;
    llmPrompt = rag.prompt;
  } else {
    // MiniSearch path
  }
} catch (error) {
  console.warn("RAG search failed; falling back to MiniSearch", error);
  // MiniSearch fallback
}

6.5 Deep dive: `src/lib/rag/index.ts`

This file acts as the RAG control tower: read settings, prepare index and caches, run semantic search, and map results into prompt + sources.

A) Module state: `vectorStore`, `isIngested`

const vectorStore = new InMemoryVectorStore();
let isIngested = false;

vectorStore stores vectors in runtime memory.
isIngested guards duplicate ingestion in the same instance.

B) Prefer prebuilt index: `loadPrebuiltIndex()`

async function loadPrebuiltIndex(_originRequestUrl: string) {
  try {
    const filePath = join(process.cwd(), "rag-index.json");
    const raw = await readFile(filePath, "utf-8");
    const chunks = JSON.parse(raw) as EmbeddedChunk[];
    if (chunks.length === 0) return null;
    return chunks;
  } catch {
    return null;
  }
}

This reads rag-index.json directly from the filesystem, trying the prebuilt index before any runtime embedding. If the file is missing or empty, it returns null and proceeds to the next path.

C) Document-to-embedding-unit conversion: `toDocumentChunks()`

return loadRAGDocuments().then(docs =>
  docs.map(doc => ({
    id: doc.id,
    docId: doc.id,
    text: `${doc.title}\n\n${doc.description}\n\n${doc.content}`,
    metadata: {
      title: doc.title,
      ...(doc.titleEn ? { titleEn: doc.titleEn } : {}),
      tags: doc.tags,
      url: doc.url,
    },
  }))
);

Document-level RAG strategy: title + description + content are combined into a single text and embedded per document. titleEn is included in metadata for multilingual source display. This was initially considered a temporary MVP approach, but eval results confirmed that document-level embedding outperforms chunking in both hit rate and MRR, so it became the final architecture. (Ref: https://github.com/Hanna922/hanna.dev/pull/19)

D) Ingestion gate: `ingestIfNeeded()`

If isIngested is true, return immediately.
Try prebuilt index.
If prebuilt exists, upsert and return.
Else fallback to runtime embedding.

if (isIngested) return;

const prebuilt = await loadPrebuiltIndex(originRequestUrl);
if (prebuilt) {
  await vectorStore.upsert(prebuilt);
  isIngested = true;
  ragLogger.info("RAG prebuilt index loaded", { chunks: prebuilt.length });
  return;
}

const embedded = await embedChunks(chunks, {
  apiKey,
  model: config.embeddingModel,
  batchSize: config.embeddingBatchSize,
});
await vectorStore.upsert(embedded);
isIngested = true;

This means normal execution should usually follow: prebuilt load -> query.

E) Main function: `runRAGSearch()`

const config = getRAGConfig();
await ingestIfNeeded(options.apiKey, options.originRequestUrl);

const hits = await semanticSearch(query, vectorStore, {
  apiKey: options.apiKey,
  model: config.embeddingModel,
  topK: config.topK,
  similarityThreshold: config.similarityThreshold,
});
const localizedHits = filterHitsByLocale(hits, options.locale ?? "ko");

return {
  hits: localizedHits,
  prompt: buildPromptWithContext(query, localizedHits, options.locale ?? "ko"),
  sources: toSourceRefsFromSemanticHits(localizedHits),
};

The flow is fixed to:

Ensure index readiness
Semantic retrieval
Locale-based filtering (prefer user’s language, fallback to other)
Prompt + source mapping

Task 7. Checkpoint - End-to-End validation

As of MVP (Task 7), these flows are complete:

Generate prebuilt index with sync-rag-index
Runtime loads prebuilt index first (src/lib/rag/index.ts)
Enable RAG with RAG_ENABLED=true for semantic search + context prompt generation
Fallback to MiniSearch on failure
Keep streaming source rendering contract in UI

Load preference is explicit in code:

// src/lib/rag/index.ts
const prebuilt = await loadPrebuiltIndex(originRequestUrl);
if (prebuilt) {
  await vectorStore.upsert(prebuilt);
  isIngested = true;
  return;
}

const embedded = await embedChunks(chunks, { apiKey, model, batchSize });
await vectorStore.upsert(embedded);

That is the first operationally usable RAG.

Eval: Document-level vs Chunked RAG

After the MVP was running, the natural question was: “Is document-level embedding really the best approach?” To answer this, I built an eval script (scripts/run-eval.ts) and ran 3 rounds of experiments with 36 evaluation items.

Experiment design

The 36 eval items were categorized into project-motivation, project-detail, concept, cross-post, negative, etc. Each item was evaluated simultaneously against Chunked RAG (284 chunks), Document-level RAG (38 documents), and MiniSearch. The key metrics were Hit Rate @5 and MRR (Mean Reciprocal Rank).

3-round results summary

Round	Change	Chunked Hit Rate	Document-level Hit Rate	Delta
1st	Baseline comparison	87.9%	97.0%	-9.1%
2nd	+ Document-level dedup	90.9%	97.0%	-6.1%
3rd	Remove chunking (Document-level confirmed)	—	97.0%	0

Chunking never outperformed Document-level in any round. Even with dedup applied in round 2, improvement was limited. The cross-post category (eval-026~028) consistently failed under chunking because splitting fragments context, making it impossible to match broad queries spanning multiple posts.

Final performance comparison (Document-level RAG vs MiniSearch)

Metric	Document-level RAG	MiniSearch
Hit Rate @5	97.0%	75.8%
MRR	64.6%	42.5%
Keyword Coverage	93.2%	94.0%
Avg Latency	0.6ms	3.1ms

RAG achieved +21.2% Hit Rate and +22.1% MRR improvement over MiniSearch. With only 38 documents indexed (vs 284 chunks), computational overhead was also reduced, improving latency.

Why Document-level won

Corpus fits within embedding context window: With 16 blog posts (8 KO + 8 EN + custom docs), each document fits entirely within the embedding model’s context window. Chunking is a technique needed for documents exceeding tens of thousands of words or containing completely unrelated topics within a single document. Neither applies here.
Chunking fragments context: Splitting documents causes partial chunks to occupy top-K slots, missing the actual relevant document for broad queries like “Are there any posts about performance optimization?”
Computational efficiency: 38 documents vs 284 chunks — fewer vectors yielded better results in both indexing and retrieval.

Based on these results, the chunking module was removed and Document-level RAG was confirmed as the final architecture.

Work planned next

After Task 7, priorities are:

Task 8: Formalize hybrid search
- Run keyword and semantic searches in parallel
- Filter by threshold first, then merge with RRF
- Tune weights (0.4 / 0.6)
Task 9: Query cache and versioning
- key strategy {query}:{indexVersion}
- TTL and invalidation improvement
Task 11: Incremental indexing + manifest
- Currently, the sync-rag-index script re-indexes all documents on every run. With only 38 documents, the cost is negligible, so this approach is used for now.
- Goal: define document id as {postId}, store contentHash and lastUpdated per document in a manifest, and only delete/upsert documents whose hash has changed.
- Once applied, re-indexing becomes idempotent — running the script multiple times causes no side effects, updating only the changed documents.
Task 12~13: Failed-document reprocessing and operations
- Retry pipelines and DLQ
- Faster recovery
Task 14~16: Production hardening
- Stable performance metrics (p50/p95), timeout budget
- stronger property/integration testing
- ops runbook and playbook updates

Closing

MiniSearch + prompt injection was a practical starting point.
But once you need accuracy, citation reliability, operability, and observability, its structural limits become clear quickly.

This requirement + design set is, in my view, an answer to: how to run a reliable search-generation system over time, not just how to switch models.

If you are considering a similar migration, I recommend writing down the design constraints from a failure perspective before selecting the architecture itself, and then carrying those constraints into tests and implementation.

From MiniSearch to RAG - Blog Search Enhancement

시작점: MiniSearch + Gemini 조합은 왜 매력적이었나

1) 키워드가 맞아야만 찾는 구조의 한계

2) 프롬프트 주입 컨텍스트의 품질 불안정

3) 출처 표기와 UI 계약 유지의 어려움

4) 운영 관점에서 재현성과 관측 가능성 부족

design 문서에서 확정한 핵심 구현 의사결정

1) 아키텍처: MiniSearch 제거가 아닌 Hybrid Search

2) 기술 스택 선택 이유를 명문화

3) 가장 중요한 결정: 인덱싱 파이프라인 분리

스크립트 수준의 분리만으로는 부족했다

GitHub Workflows 수준의 분리

이중 배포 방지: paths-ignore + workflow_run

왜 deploy.yml의 push 트리거를 제거하지 않았는가

4) 코드 블록 처리 원칙 분리

5) 점수 정규화와 품질 필터링

6) 출처 정확도 개선의 단계적 적용

7) Timeout/Cache 설계

Correctness Properties를 정의한 이유

테스트 전략도 함께 문서화한 이유

핵심 교훈: RAG 도입의 본질은 모델 교체가 아니라 운영 규약 수립

실제 구현 기록: MVP Task 1~7

Task 1. Phase 0 - 안전한 기반 작업

1.1 환경변수/설정 로더

1.2 벡터 스토어 추상화

1.3 로깅/메트릭 유틸

Task 2. Phase 1A - 문서 처리 파이프라인

2.1 문서 로더 구현

2.x 청킹에 대한 의사결정

Task 3. Phase 1A - 임베딩 생성

3.1 임베딩 서비스

3.3/3.4 sync 스크립트

Task 5. Phase 1B - 의미 검색/컨텍스트 구성

5.1 의미 검색

5.3 컨텍스트 포맷터

Task 6. Phase 1C - /api/search 통합

6.5 src/lib/rag/index.ts 심화 설명 (핵심 런타임 오케스트레이션)

A. 모듈 레벨 상태: vectorStore, isIngested

B. prebuilt 우선 로딩: loadPrebuiltIndex()

C. 문서→임베딩 단위 변환: toDocumentChunks()

D. 인제스트 게이트: ingestIfNeeded()

E. 최종 실행 함수: runRAGSearch()

Task 7. Checkpoint - End-to-End 동작 확인

Eval: Document-level vs Chunked RAG

실험 설계

3라운드 결과 요약

최종 성능 비교 (Document-level RAG vs MiniSearch)

왜 Document-level이 이겼는가

앞으로 Develop 할 부분 (다음 단계)

마무리

From MiniSearch to RAG - Blog Search Enhancement

Starting point: Why MiniSearch + Gemini was initially attractive

1) The limit of exact-match keyword retrieval

2) Unstable quality of prompt injection context

3) Output source and UI contract mismatches

4) Poor observability and reproducibility in operation

Key implementation decisions finalized in the design doc

1) Architecture: Hybrid Search, not removing MiniSearch

2) Why we chose each technology explicitly

3) Core decision: decouple the indexing pipeline

Script-level separation was not enough

GitHub Workflows-level separation

Preventing duplicate deploys: paths-ignore + workflow_run

Why we kept the push trigger on deploy.yml

4) Splitting code-block treatment by search purpose

5) Score normalization and quality filters

6) Gradual source-accuracy rollout

7) Timeout and cache behavior

Why we defined correctness properties

Why we also documented a test strategy

Main lesson: RAG is not just model substitution

Implementation log: MVP Task 1~7

Task 1. Phase 0 - Safe foundation

1.1 Env + config loader

1.2 Vector store abstraction

1.3 Logging utility

Task 2. Phase 1A - Document pipeline

2.1 Document loader

2.2 Decision on chunking

Task 3. Phase 1A - Embedding generation

Task 6. Phase 1C - `/api/search` 통합

6.5 `src/lib/rag/index.ts` 심화 설명 (핵심 런타임 오케스트레이션)

A. 모듈 레벨 상태: `vectorStore`, `isIngested`

B. prebuilt 우선 로딩: `loadPrebuiltIndex()`

C. 문서→임베딩 단위 변환: `toDocumentChunks()`

D. 인제스트 게이트: `ingestIfNeeded()`

E. 최종 실행 함수: `runRAGSearch()`

Task 6. Phase 1C - `/api/search` integration

6.5 Deep dive: `src/lib/rag/index.ts`

A) Module state: `vectorStore`, `isIngested`

B) Prefer prebuilt index: `loadPrebuiltIndex()`

C) Document-to-embedding-unit conversion: `toDocumentChunks()`

D) Ingestion gate: `ingestIfNeeded()`

E) Main function: `runRAGSearch()`