<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Chunking on DevOps Way - Практические гайды</title>
    <link>https://devopsway.ru/tags/chunking/</link>
    <description>Recent content in Chunking on DevOps Way - Практические гайды</description>
    <image>
      <title>DevOps Way - Практические гайды</title>
      <url>https://devopsway.ru/images/devopsway-og.png</url>
      <link>https://devopsway.ru/images/devopsway-og.png</link>
    </image>
    <generator>Hugo -- 0.161.1</generator>
    <language>ru</language>
    <lastBuildDate>Wed, 20 May 2026 10:33:03 -0400</lastBuildDate>
    <atom:link href="https://devopsway.ru/tags/chunking/feed.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>RAG Pipeline 3/N: Чанки – как резать текст, чтобы модель не получала мусор</title>
      <link>https://devopsway.ru/posts/rag-03-chunking/</link>
      <pubDate>Wed, 20 May 2026 12:00:00 +0300</pubDate>
      <guid>https://devopsway.ru/posts/rag-03-chunking/</guid>
      <description>Почему 800 символов, а не 1500. Семантическая нарезка по заголовкам, overlap, санитизация, метаданные. Реальный код и продакшен-параметры из pipeline на 3 010 фрагментов.</description>
      <content:encoded><![CDATA[<table>
  <thead>
      <tr>
          <th>Параметр</th>
          <th>Значение</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Bloom</td>
          <td>L3–L4 (Применение → Анализ)</td>
      </tr>
      <tr>
          <td>SFIA</td>
          <td>Уровень 2–3</td>
      </tr>
      <tr>
          <td>Dreyfus</td>
          <td>Advanced Beginner → Competent</td>
      </tr>
      <tr>
          <td>Артефакт</td>
          <td>Скрипт нарезки markdown + stats</td>
      </tr>
      <tr>
          <td>Проверка</td>
          <td>180+ файлов → 3 010 чанков, 0 ошибок Ollama</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="tldr">TL;DR</h2>
<p>Нарезка текста на фрагменты (chunking) – этап, который влияет на качество RAG не меньше, чем выбор модели. Режем по заголовкам H2/H3, дорезаем с перекрытием 150 символов, чистим мусор. 800 символов – потолок для русского текста при 512-токенном лимите модели.</p>
<hr>
<h2 id="проблема-модель-видит-не-файл-а-огрызок">Проблема: модель видит не файл, а огрызок</h2>
<p>В <a href="/posts/rag-02-embeddings/">прошлом посте</a> мы выбрали модель эмбеддинга (mxbai-embed-large) и научились превращать текст в вектора. Но модель не индексирует файл целиком – она получает фрагменты. И если фрагмент обрезан посередине мысли, вектор будет описывать бессмыслицу.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">ПЛОХАЯ НАРЕЗКА (по 1500 символов):
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Файл nginx-guide.md:
</span></span><span class="line"><span class="cl">┌──────────────────────────────────────────────────────────┐
</span></span><span class="line"><span class="cl">│  ## Reverse Proxy                                         │
</span></span><span class="line"><span class="cl">│  Для настройки reverse proxy используйте proxy_pass.      │
</span></span><span class="line"><span class="cl">│  Основные параметры:                                      │
</span></span><span class="line"><span class="cl">│  - proxy_set_header Host $host;                           │
</span></span><span class="line"><span class="cl">│  - proxy_set_header X-Real-IP $remote_addr;               │
</span></span><span class="line"><span class="cl">│  ...                                                      │
</span></span><span class="line"><span class="cl">│  ## SSL/TLS                                               │  ← граница чанка
</span></span><span class="line"><span class="cl">│  Для включения HTTPS добавьте в секцию server: ──────────►│    прошла посередине
</span></span><span class="line"><span class="cl">│  ssl_certificate /etc/nginx/ssl/cert.pem;                 │    новой темы
</span></span><span class="line"><span class="cl">│  ssl_certificate_key /etc/nginx/ssl/key.pem;              │
</span></span><span class="line"><span class="cl">└──────────────────────────────────────────────────────────┘
</span></span><span class="line"><span class="cl">Чанк 1: Reverse Proxy + начало SSL → вектор описывает &#34;что-то про nginx&#34;
</span></span><span class="line"><span class="cl">Чанк 2: Остаток SSL без контекста → &#34;cert.pem и key.pem&#34; без объяснения зачем
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">ХОРОШАЯ НАРЕЗКА (по H2/H3):
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Чанк 1: [## Reverse Proxy] — полная секция про proxy_pass
</span></span><span class="line"><span class="cl">Чанк 2: [## SSL/TLS] — полная секция про сертификаты
</span></span></code></pre></div><p>Первая версия моего pipeline резала по 1500 символов. Ollama падала с HTTP 500. Binary search показал: worst-case русский markdown с ссылками укладывает 512 токенов модели в 912 символов. 1500 символов – это ~840 токенов, на 60% больше лимита. Уменьшил до 800 – ошибки ушли.</p>
<p>Но размер – полдела. Фрагмент <code>&quot;…настройка Cad&quot;</code> без продолжения <code>&quot;dy reverse proxy&quot;</code> бесполезен. Нужна стратегия нарезки.</p>
<hr>
<h2 id="стратегия-два-уровня-нарезки">Стратегия: два уровня нарезки</h2>
<h3 id="уровень-1-семантическая-нарезка-по-заголовкам">Уровень 1. Семантическая нарезка по заголовкам</h3>
<p>Markdown-файлы уже содержат структуру – заголовки H2 (<code>##</code>) и H3 (<code>###</code>). Каждый заголовок начинает логически завершённую мысль. Режем по ним:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">chunk_by_sections</span><span class="p">(</span><span class="n">content</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">file_path</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Режем markdown по H2/H3 заголовкам.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">chunks</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">    <span class="n">lines</span> <span class="o">=</span> <span class="n">content</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">current_section</span> <span class="o">=</span> <span class="n">file_path</span>  <span class="c1"># имя файла как fallback</span>
</span></span><span class="line"><span class="cl">    <span class="n">current_lines</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">lines</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">heading_match</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="k">match</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;^(#{2,3})\s+(.+)&#39;</span><span class="p">,</span> <span class="n">line</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">heading_match</span> <span class="ow">and</span> <span class="n">current_lines</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="c1"># Предыдущая секция завершена -- сохраняем</span>
</span></span><span class="line"><span class="cl">            <span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">current_lines</span><span class="p">)</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">text</span> <span class="ow">and</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="mi">20</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="n">chunks</span><span class="o">.</span><span class="n">extend</span><span class="p">(</span><span class="n">_split_large_chunk</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">current_section</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="n">current_lines</span> <span class="o">=</span> <span class="p">[</span><span class="n">line</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="n">current_section</span> <span class="o">=</span> <span class="n">heading_match</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">current_lines</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Последняя секция</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">current_lines</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">text</span> <span class="o">=</span> <span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">current_lines</span><span class="p">)</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">text</span> <span class="ow">and</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="mi">20</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">chunks</span><span class="o">.</span><span class="n">extend</span><span class="p">(</span><span class="n">_split_large_chunk</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">current_section</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">chunks</span>
</span></span></code></pre></div><p>Почему H2/H3, а не H1? В markdown-файлах H1 (<code>#</code>) обычно один – заголовок документа. Смысловые блоки начинаются с H2 и H3. H4 и глубже слишком мелкие – нарезка по ним даст фрагменты в 2-3 строки.</p>
<h3 id="уровень-2-дорезка-с-перекрытием-overlap">Уровень 2. Дорезка с перекрытием (overlap)</h3>
<p>Если секция длиннее 800 символов – дорезаем на куски с перекрытием:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">CHUNK_SIZE</span> <span class="o">=</span> <span class="mi">800</span>   <span class="c1"># символов</span>
</span></span><span class="line"><span class="cl"><span class="n">CHUNK_OVERLAP</span> <span class="o">=</span> <span class="mi">150</span>  <span class="c1"># символов</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">_split_large_chunk</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">section</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Дорезка больших секций с overlap.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">)</span> <span class="o">&lt;=</span> <span class="n">CHUNK_SIZE</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="p">[{</span><span class="s1">&#39;text&#39;</span><span class="p">:</span> <span class="n">text</span><span class="p">,</span> <span class="s1">&#39;section&#39;</span><span class="p">:</span> <span class="n">section</span><span class="p">}]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">pieces</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">    <span class="n">start</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">    <span class="n">idx</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="n">start</span> <span class="o">&lt;</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">end</span> <span class="o">=</span> <span class="n">start</span> <span class="o">+</span> <span class="n">CHUNK_SIZE</span>
</span></span><span class="line"><span class="cl">        <span class="n">piece</span> <span class="o">=</span> <span class="n">text</span><span class="p">[</span><span class="n">start</span><span class="p">:</span><span class="n">end</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="n">pieces</span><span class="o">.</span><span class="n">append</span><span class="p">({</span>
</span></span><span class="line"><span class="cl">            <span class="s1">&#39;text&#39;</span><span class="p">:</span> <span class="n">piece</span><span class="o">.</span><span class="n">strip</span><span class="p">(),</span>
</span></span><span class="line"><span class="cl">            <span class="s1">&#39;section&#39;</span><span class="p">:</span> <span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">section</span><span class="si">}</span><span class="s2"> (part </span><span class="si">{</span><span class="n">idx</span> <span class="o">+</span> <span class="mi">1</span><span class="si">}</span><span class="s2">)&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="p">})</span>
</span></span><span class="line"><span class="cl">        <span class="n">start</span> <span class="o">=</span> <span class="n">end</span> <span class="o">-</span> <span class="n">CHUNK_OVERLAP</span>  <span class="c1"># сдвиг с перекрытием</span>
</span></span><span class="line"><span class="cl">        <span class="n">idx</span> <span class="o">+=</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">pieces</span>
</span></span></code></pre></div><p><strong>Зачем overlap?</strong> Без него контекст на стыке теряется. Предложение, начатое в конце чанка 1, продолжается в начале чанка 2 – но каждый чанк индексируется отдельно. Перекрытие 150 символов дублирует границу в обоих фрагментах.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">Без overlap:
</span></span><span class="line"><span class="cl">  Чанк 1: [───────────────────]
</span></span><span class="line"><span class="cl">  Чанк 2:                      [───────────────────]
</span></span><span class="line"><span class="cl">  Потеря:                      ↑ контекст разорван
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">С overlap 150:
</span></span><span class="line"><span class="cl">  Чанк 1: [───────────────────]
</span></span><span class="line"><span class="cl">  Чанк 2:              [───────────────────]
</span></span><span class="line"><span class="cl">  Перекрытие:          ^^^^^^^^ 150 символов дублируются
</span></span></code></pre></div><p>800/150 – соотношение, к которому я пришёл после трёх итераций. Overlap меньше 100 – контекст всё ещё рвётся. Больше 200 – слишком много дублирования, Qdrant раздувается.</p>
<hr>
<h2 id="почему-800-символов">Почему 800 символов</h2>
<p>Это не магическое число. Это результат binary search на реальных данных.</p>
<p><strong>Ограничение модели:</strong> mxbai-embed-large принимает максимум 512 токенов. При превышении Ollama возвращает HTTP 500 с <code>&quot;context length exceeded&quot;</code>.</p>
<p><strong>Русский текст дороже английского:</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">English: &#34;container&#34;     → 1 token
</span></span><span class="line"><span class="cl">Русский: &#34;контейнер&#34;     → ~9 tokens  (кириллица дробится на subword-фрагменты)
</span></span><span class="line"><span class="cl">Русский: &#34;проксирование&#34; → ~11 tokens
</span></span></code></pre></div><p>Одно русское слово = 8-11 токенов. При этом URL и markdown-разметка тоже не сжимаются. Worst-case: 912 символов русского markdown с ссылками = 512 токенов. Значит 800 символов – это ~450 токенов, есть запас.</p>
<p><strong>Binary search:</strong></p>
<table>
  <thead>
      <tr>
          <th>Размер чанка</th>
          <th>Токены (worst-case)</th>
          <th>Результат</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1500 chars</td>
          <td>~840 tokens</td>
          <td>HTTP 500, Ollama падает</td>
      </tr>
      <tr>
          <td>1200 chars</td>
          <td>~670 tokens</td>
          <td>HTTP 500 на длинных ссылках</td>
      </tr>
      <tr>
          <td>912 chars</td>
          <td>~512 tokens</td>
          <td>Граница: проходит впритык</td>
      </tr>
      <tr>
          <td>800 chars</td>
          <td>~450 tokens</td>
          <td>Стабильно, с запасом</td>
      </tr>
  </tbody>
</table>
<p>&ldquo;Worst-case русский markdown с ссылками&rdquo; – это текст с URL, markdown-форматированием и русским текстом одновременно. URL не сжимаются токенизатором (каждый символ = токен), поэтому worst-case хуже, чем чистый текст.</p>
<p><strong>800 – компромисс:</strong> достаточно длинный, чтобы секция несла смысл. Достаточно короткий, чтобы не превысить лимит модели.</p>
<hr>
<h2 id="санитизация-чистим-мусор-до-embedding">Санитизация: чистим мусор до embedding</h2>
<p>Грязный текст = шумный вектор. Шумный вектор &ldquo;притягивает&rdquo; нерелевантные результаты при поиске.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">CHUNK_SIZE</span> <span class="o">=</span> <span class="mi">800</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">sanitize_for_embedding</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Чистим текст перед отправкой в модель.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="ow">not</span> <span class="n">text</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="s2">&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># 1. Unicode replacement characters (\ufffd)</span>
</span></span><span class="line"><span class="cl">    <span class="c1">#    Появляются при чтении файлов с errors=&#39;replace&#39;</span>
</span></span><span class="line"><span class="cl">    <span class="c1">#    Невидимы глазу, но ломают Ollama</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\ufffd</span><span class="s1">&#39;</span><span class="p">,</span> <span class="s1">&#39;&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># 2. Управляющие символы (C0/C1)</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]&#39;</span><span class="p">,</span> <span class="s1">&#39;&#39;</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># 3. Перекодировка: убираем невалидный UTF-8</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">&#39;utf-8&#39;</span><span class="p">,</span> <span class="n">errors</span><span class="o">=</span><span class="s1">&#39;ignore&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="s1">&#39;utf-8&#39;</span><span class="p">,</span> <span class="n">errors</span><span class="o">=</span><span class="s1">&#39;ignore&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># 4. Длинные строки без пробелов (base64, хеши, URL)</span>
</span></span><span class="line"><span class="cl">    <span class="c1">#    Обрезаем до 200 символов -- дальше бесполезно для поиска</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;\S{200,}&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">m</span><span class="p">:</span> <span class="n">m</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="mi">0</span><span class="p">)[:</span><span class="mi">200</span><span class="p">]</span> <span class="o">+</span> <span class="s1">&#39;...&#39;</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># 5. Множественные пробелы и пустые строки</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;[ \t]{10,}&#39;</span><span class="p">,</span> <span class="s1">&#39;  &#39;</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;\n{5,}&#39;</span><span class="p">,</span> <span class="s1">&#39;</span><span class="se">\n\n\n</span><span class="s1">&#39;</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># 6. Обрезаем до CHUNK_SIZE</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="p">[:</span><span class="n">CHUNK_SIZE</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># 7. Quality gate: слишком короткий = мусор</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">20</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="s2">&#34;&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">text</span>
</span></span></code></pre></div><h3 id="что-именно-чистим-и-почему">Что именно чистим и почему</h3>
<table>
  <thead>
      <tr>
          <th>Мусор</th>
          <th>Пример</th>
          <th>Проблема</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>\ufffd</code></td>
          <td>Повреждённый файл с <code>errors='replace'</code></td>
          <td>Невидим, но Ollama возвращает 500</td>
      </tr>
      <tr>
          <td>Control chars</td>
          <td><code>\x00</code>-<code>\x1f</code></td>
          <td>Ломают HTTP-запрос к Ollama</td>
      </tr>
      <tr>
          <td>Длинные строки</td>
          <td>base64-блобы, SHA-хеши</td>
          <td>Занимают токены без смысла</td>
      </tr>
      <tr>
          <td>Пустые строки</td>
          <td>10+ <code>\n</code> подряд</td>
          <td>Раздувают чанк, вытесняя контент</td>
      </tr>
  </tbody>
</table>
<p><strong>Quality gate (&lt; 20 символов):</strong> фрагмент &ldquo;## Заголовок&rdquo; без тела – это 12 символов. Он не несёт достаточно смысла для embedding. Отбрасываем.</p>
<hr>
<h2 id="метаданные-зачем-чанку-знать-откуда-он">Метаданные: зачем чанку знать, откуда он</h2>
<p>Каждый фрагмент в Qdrant хранит не только вектор, но и payload:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">payload</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;file_path&#39;</span><span class="p">:</span> <span class="s1">&#39;nora-sprint-status.md&#39;</span><span class="p">,</span>     <span class="c1"># откуда</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;section&#39;</span><span class="p">:</span> <span class="s1">&#39;v0.9.0 RELEASED (18.05.2026)&#39;</span><span class="p">,</span> <span class="c1"># какая секция</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;zone&#39;</span><span class="p">:</span> <span class="s1">&#39;warm&#39;</span><span class="p">,</span>                             <span class="c1"># hot/warm/cold</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;project&#39;</span><span class="p">:</span> <span class="s1">&#39;nora&#39;</span><span class="p">,</span>                          <span class="c1"># какой проект</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;mtime&#39;</span><span class="p">:</span> <span class="mf">1747584000.0</span><span class="p">,</span>                      <span class="c1"># когда изменён</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;content_hash&#39;</span><span class="p">:</span> <span class="s1">&#39;a1b2c3d4...&#39;</span><span class="p">,</span>              <span class="c1"># md5 для change detection</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;text_preview&#39;</span><span class="p">:</span> <span class="s1">&#39;ARM64 binary + multi...&#39;</span><span class="p">,</span>  <span class="c1"># первые 500 символов</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><h3 id="три-зоны-памяти">Три зоны памяти</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">hot   = MEMORY.md         — главный файл, ядро контекста
</span></span><span class="line"><span class="cl">warm  = memory/*.md       — активные файлы проектов
</span></span><span class="line"><span class="cl">cold  = memory/_archive/  — завершённые проекты
</span></span></code></pre></div><p>Зоны позволяют фильтровать поиск. Запрос &ldquo;текущий статус NORA&rdquo; ищет в hot + warm. Запрос &ldquo;как мы решали баг X в марте&rdquo; – в cold. Без зон модель возвращает устаревший контекст наравне с актуальным.</p>
<h3 id="проект-из-имени-файла">Проект из имени файла</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">PROJECT_PATTERNS</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;nora&#39;</span><span class="p">:</span> <span class="sa">r</span><span class="s1">&#39;^nora[-_]&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;pusk&#39;</span><span class="p">:</span> <span class="sa">r</span><span class="s1">&#39;^pusk[-_]&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;kmb&#39;</span><span class="p">:</span> <span class="sa">r</span><span class="s1">&#39;^(kmb[-_]|curriculum[-_]|kokon)&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;jarvis&#39;</span><span class="p">:</span> <span class="sa">r</span><span class="s1">&#39;^(jarvis|JARVIS)&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;infra&#39;</span><span class="p">:</span> <span class="sa">r</span><span class="s1">&#39;^(infrastructure|bastion|pve|vault)&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># ...</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>Файл <code>nora-sprint-status.md</code> автоматически получает <code>project: 'nora'</code>. Это позволяет искать &ldquo;circuit breaker&rdquo; только в контексте NORA, а не во всех 180 файлах.</p>
<h3 id="change-detection">Change detection</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">content_hash</span> <span class="o">=</span> <span class="n">hashlib</span><span class="o">.</span><span class="n">md5</span><span class="p">(</span><span class="n">text</span><span class="o">.</span><span class="n">encode</span><span class="p">())</span><span class="o">.</span><span class="n">hexdigest</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="k">if</span> <span class="n">content_hash</span> <span class="ow">in</span> <span class="n">existing_hashes</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">skipped_chunks</span> <span class="o">+=</span> <span class="mi">1</span>  <span class="c1"># не переиндексируем</span>
</span></span><span class="line"><span class="cl">    <span class="k">continue</span>
</span></span></code></pre></div><p>md5 от текста чанка – если текст не изменился, пропускаем. Инкрементальная переиндексация: systemd timer запускается каждые 10 минут, но переиндексирует только изменённые файлы. Полный проход по 180 файлам занимает секунды.</p>
<hr>
<h2 id="progressive-truncation-fallback-при-переполнении">Progressive truncation: fallback при переполнении</h2>
<p>Даже с лимитом 800 символов модель иногда не справляется – markdown со ссылками и таблицами может генерировать больше токенов, чем ожидалось. Решение – progressive truncation:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">get_embedding</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">retries</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">2</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Embedding с progressive truncation.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">sanitize_for_embedding</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Пробуем полный текст, потом короче</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">text_len</span> <span class="ow">in</span> <span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">),</span> <span class="mi">600</span><span class="p">,</span> <span class="mi">400</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">        <span class="n">prompt</span> <span class="o">=</span> <span class="n">text</span><span class="p">[:</span><span class="n">text_len</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">prompt</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">20</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">attempt</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">retries</span> <span class="o">+</span> <span class="mi">1</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="n">resp</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;http://</span><span class="si">{</span><span class="n">OLLAMA_HOST</span><span class="si">}</span><span class="s2">:</span><span class="si">{</span><span class="n">OLLAMA_PORT</span><span class="si">}</span><span class="s2">/api/embeddings&#34;</span><span class="p">,</span> <span class="n">json</span><span class="o">=</span><span class="p">{</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;model&#34;</span><span class="p">:</span> <span class="s2">&#34;mxbai-embed-large&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                    <span class="s2">&#34;prompt&#34;</span><span class="p">:</span> <span class="n">prompt</span>
</span></span><span class="line"><span class="cl">                <span class="p">},</span> <span class="n">timeout</span><span class="o">=</span><span class="mi">75</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">                <span class="k">if</span> <span class="n">resp</span><span class="o">.</span><span class="n">status_code</span> <span class="o">==</span> <span class="mi">200</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                    <span class="k">return</span> <span class="n">resp</span><span class="o">.</span><span class="n">json</span><span class="p">()</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">&#39;embedding&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">                <span class="k">elif</span> <span class="n">resp</span><span class="o">.</span><span class="n">status_code</span> <span class="o">==</span> <span class="mi">500</span> <span class="ow">and</span> <span class="s2">&#34;context length&#34;</span> <span class="ow">in</span> <span class="n">resp</span><span class="o">.</span><span class="n">text</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                    <span class="k">break</span>  <span class="c1"># обрезаем и пробуем короче</span>
</span></span><span class="line"><span class="cl">            <span class="k">except</span> <span class="ne">Exception</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="k">if</span> <span class="n">attempt</span> <span class="o">&lt;</span> <span class="n">retries</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                    <span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">2</span> <span class="o">**</span> <span class="n">attempt</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="kc">None</span>
</span></span></code></pre></div><p>Цепочка: 800 → 600 → 400 символов. На каждом уровне – retry с экспоненциальным backoff. В продакшене progressive truncation срабатывает на единицах чанков из тысяч. Обычно это таблицы с длинными URL.</p>
<hr>
<h2 id="мини-тест">Мини-тест</h2>
<p><strong>1. Файл 2000 символов, CHUNK_SIZE=800, CHUNK_OVERLAP=150. Сколько чанков получится?</strong></p>
<details>
<summary>Ответ</summary>
<p>Четыре чанка. Шаг сдвига = CHUNK_SIZE - CHUNK_OVERLAP = 650:</p>
<ul>
<li>Чанк 1: символы 0–800 (800 символов)</li>
<li>Чанк 2: символы 650–1450 (800 символов)</li>
<li>Чанк 3: символы 1300–2000 (700 символов)</li>
<li>Чанк 4: символы 1950–2000 (50 символов — хвост overlap)</li>
</ul>
<p>Почему не 3? Код вычисляет следующий <code>start = end - CHUNK_OVERLAP</code>, где <code>end = start + CHUNK_SIZE</code> — даже если <code>end &gt; len(text)</code>. После чанка 3 <code>start = 2100 - 150 = 1950</code>, что меньше 2000, поэтому цикл делает ещё один проход. Это нюанс реализации: последний фрагмент может быть очень коротким.</p>
</details>
<p><strong>2. Фрагмент после санитизации: <code>&quot;## Заголовок&quot;</code> (12 символов). Что с ним произойдёт?</strong></p>
<details>
<summary>Ответ</summary>
<p>Будет отброшен. Quality gate отсекает фрагменты короче 20 символов – они не несут достаточно смысла для embedding. Заголовок без тела бесполезен для поиска.</p>
</details>
<p><strong>3. Overlap = 0 (без перекрытия). Какой баг это вызовет?</strong></p>
<details>
<summary>Ответ</summary>
<p>Предложения на границе чанков будут разорваны. Фрагмент <code>&quot;...настройка Cad&quot;</code> потеряет продолжение <code>&quot;dy reverse proxy&quot;</code>. Вектор будет описывать бессмыслицу, и поиск по запросу &ldquo;Caddy reverse proxy&rdquo; не найдёт этот фрагмент. Overlap дублирует границу в обоих чанках, сохраняя контекст.</p>
</details>
<p><strong>4. В Qdrant 3 010 чанков. Файл изменился (1 строка). Сколько чанков переиндексируется?</strong></p>
<details>
<summary>Ответ</summary>
<p>Только чанки из изменённого файла, в которых content_hash (md5) изменился. Если строка добавлена в одну H2-секцию – переиндексируется 1-3 чанка (секция + возможные соседние при дорезке). Остальные 3 000+ чанков пропускаются за миллисекунды (hash lookup).</p>
</details>
<hr>
<h2 id="артефакт-скрипт-нарезки-markdown">Артефакт: скрипт нарезки markdown</h2>
<p>Полный скрипт, который можно запустить на своих файлах. Отличие от продакшен-кода выше: <code>sanitize()</code> здесь не обрезает текст до CHUNK_SIZE – этим занимается <code>split_large()</code>, потому что sanitize вызывается <strong>до</strong> нарезки, а не после (как <code>sanitize_for_embedding</code> в продакшене).</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="ch">#!/usr/bin/env python3</span>
</span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">chunk-markdown.py -- нарезка markdown-файлов для RAG pipeline.
</span></span></span><span class="line"><span class="cl"><span class="s2">Режет по H2/H3, дорезает с overlap, чистит мусор.
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">Запуск:
</span></span></span><span class="line"><span class="cl"><span class="s2">  python3 chunk-markdown.py /path/to/docs/
</span></span></span><span class="line"><span class="cl"><span class="s2">  python3 chunk-markdown.py /path/to/docs/ --stats
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">argparse</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">re</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">sys</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">CHUNK_SIZE</span> <span class="o">=</span> <span class="mi">800</span>
</span></span><span class="line"><span class="cl"><span class="n">CHUNK_OVERLAP</span> <span class="o">=</span> <span class="mi">150</span>
</span></span><span class="line"><span class="cl"><span class="n">MIN_CHUNK_LEN</span> <span class="o">=</span> <span class="mi">20</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">sanitize</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Чистим текст от мусора. Не обрезает по CHUNK_SIZE — этим занимается split_large.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="ow">not</span> <span class="n">text</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="s2">&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\ufffd</span><span class="s1">&#39;</span><span class="p">,</span> <span class="s1">&#39;&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]&#39;</span><span class="p">,</span> <span class="s1">&#39;&#39;</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">&#39;utf-8&#39;</span><span class="p">,</span> <span class="n">errors</span><span class="o">=</span><span class="s1">&#39;ignore&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="s1">&#39;utf-8&#39;</span><span class="p">,</span> <span class="n">errors</span><span class="o">=</span><span class="s1">&#39;ignore&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;\S{200,}&#39;</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">m</span><span class="p">:</span> <span class="n">m</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="mi">0</span><span class="p">)[:</span><span class="mi">200</span><span class="p">]</span> <span class="o">+</span> <span class="s1">&#39;...&#39;</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;[ \t]{10,}&#39;</span><span class="p">,</span> <span class="s1">&#39;  &#39;</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;\n{5,}&#39;</span><span class="p">,</span> <span class="s1">&#39;</span><span class="se">\n\n\n</span><span class="s1">&#39;</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">)</span> <span class="o">&lt;</span> <span class="n">MIN_CHUNK_LEN</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="s2">&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">text</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">split_large</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">section</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Дорезка больших секций с overlap.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">)</span> <span class="o">&lt;=</span> <span class="n">CHUNK_SIZE</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="p">[{</span><span class="s1">&#39;text&#39;</span><span class="p">:</span> <span class="n">text</span><span class="p">,</span> <span class="s1">&#39;section&#39;</span><span class="p">:</span> <span class="n">section</span><span class="p">,</span> <span class="s1">&#39;chars&#39;</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">)}]</span>
</span></span><span class="line"><span class="cl">    <span class="n">pieces</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">    <span class="n">start</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">    <span class="n">idx</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">    <span class="k">while</span> <span class="n">start</span> <span class="o">&lt;</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">end</span> <span class="o">=</span> <span class="n">start</span> <span class="o">+</span> <span class="n">CHUNK_SIZE</span>
</span></span><span class="line"><span class="cl">        <span class="n">piece</span> <span class="o">=</span> <span class="n">text</span><span class="p">[</span><span class="n">start</span><span class="p">:</span><span class="n">end</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">piece</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">pieces</span><span class="o">.</span><span class="n">append</span><span class="p">({</span>
</span></span><span class="line"><span class="cl">                <span class="s1">&#39;text&#39;</span><span class="p">:</span> <span class="n">piece</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s1">&#39;section&#39;</span><span class="p">:</span> <span class="sa">f</span><span class="s2">&#34;</span><span class="si">{</span><span class="n">section</span><span class="si">}</span><span class="s2"> (part </span><span class="si">{</span><span class="n">idx</span> <span class="o">+</span> <span class="mi">1</span><span class="si">}</span><span class="s2">)&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                <span class="s1">&#39;chars&#39;</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">piece</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">            <span class="p">})</span>
</span></span><span class="line"><span class="cl">        <span class="n">start</span> <span class="o">=</span> <span class="n">end</span> <span class="o">-</span> <span class="n">CHUNK_OVERLAP</span>
</span></span><span class="line"><span class="cl">        <span class="n">idx</span> <span class="o">+=</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">pieces</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">chunk_file</span><span class="p">(</span><span class="n">filepath</span><span class="p">:</span> <span class="n">Path</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Нарезка одного markdown-файла.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">content</span> <span class="o">=</span> <span class="n">filepath</span><span class="o">.</span><span class="n">read_text</span><span class="p">(</span><span class="n">encoding</span><span class="o">=</span><span class="s1">&#39;utf-8&#39;</span><span class="p">,</span> <span class="n">errors</span><span class="o">=</span><span class="s1">&#39;replace&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">chunks</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">    <span class="n">lines</span> <span class="o">=</span> <span class="n">content</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">current_section</span> <span class="o">=</span> <span class="n">filepath</span><span class="o">.</span><span class="n">name</span>
</span></span><span class="line"><span class="cl">    <span class="n">current_lines</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">lines</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">heading</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="k">match</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;^(#{2,3})\s+(.+)&#39;</span><span class="p">,</span> <span class="n">line</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">heading</span> <span class="ow">and</span> <span class="n">current_lines</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">text</span> <span class="o">=</span> <span class="n">sanitize</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">current_lines</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">text</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="n">chunks</span><span class="o">.</span><span class="n">extend</span><span class="p">(</span><span class="n">split_large</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">current_section</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="n">current_lines</span> <span class="o">=</span> <span class="p">[</span><span class="n">line</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="n">current_section</span> <span class="o">=</span> <span class="n">heading</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">current_lines</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">current_lines</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">text</span> <span class="o">=</span> <span class="n">sanitize</span><span class="p">(</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">current_lines</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">text</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">chunks</span><span class="o">.</span><span class="n">extend</span><span class="p">(</span><span class="n">split_large</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">current_section</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">chunks</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">    <span class="n">parser</span> <span class="o">=</span> <span class="n">argparse</span><span class="o">.</span><span class="n">ArgumentParser</span><span class="p">(</span><span class="n">description</span><span class="o">=</span><span class="s2">&#34;Chunk markdown files for RAG&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s2">&#34;path&#34;</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s2">&#34;Directory with .md files&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s2">&#34;--stats&#34;</span><span class="p">,</span> <span class="n">action</span><span class="o">=</span><span class="s2">&#34;store_true&#34;</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s2">&#34;Show stats only&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse_args</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">path</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">path</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="ow">not</span> <span class="n">path</span><span class="o">.</span><span class="n">is_dir</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Not a directory: </span><span class="si">{</span><span class="n">path</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">,</span> <span class="n">file</span><span class="o">=</span><span class="n">sys</span><span class="o">.</span><span class="n">stderr</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">sys</span><span class="o">.</span><span class="n">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">total_files</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">    <span class="n">total_chunks</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">    <span class="n">total_chars</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">md</span> <span class="ow">in</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">path</span><span class="o">.</span><span class="n">rglob</span><span class="p">(</span><span class="s1">&#39;*.md&#39;</span><span class="p">)):</span>
</span></span><span class="line"><span class="cl">        <span class="n">chunks</span> <span class="o">=</span> <span class="n">chunk_file</span><span class="p">(</span><span class="n">md</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">total_files</span> <span class="o">+=</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">        <span class="n">total_chunks</span> <span class="o">+=</span> <span class="nb">len</span><span class="p">(</span><span class="n">chunks</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">args</span><span class="o">.</span><span class="n">stats</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">total_chars</span> <span class="o">+=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">c</span><span class="p">[</span><span class="s1">&#39;chars&#39;</span><span class="p">]</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">chunks</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">continue</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">chunks</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;--- </span><span class="si">{</span><span class="n">md</span><span class="o">.</span><span class="n">name</span><span class="si">}</span><span class="s2"> | </span><span class="si">{</span><span class="n">chunk</span><span class="p">[</span><span class="s1">&#39;section&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2"> | </span><span class="si">{</span><span class="n">chunk</span><span class="p">[</span><span class="s1">&#39;chars&#39;</span><span class="p">]</span><span class="si">}</span><span class="s2"> chars ---&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="nb">print</span><span class="p">(</span><span class="n">chunk</span><span class="p">[</span><span class="s1">&#39;text&#39;</span><span class="p">][:</span><span class="mi">200</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">chunk</span><span class="p">[</span><span class="s1">&#39;chars&#39;</span><span class="p">]</span> <span class="o">&gt;</span> <span class="mi">200</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;  ... (</span><span class="si">{</span><span class="n">chunk</span><span class="p">[</span><span class="s1">&#39;chars&#39;</span><span class="p">]</span> <span class="o">-</span> <span class="mi">200</span><span class="si">}</span><span class="s2"> more chars)&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="nb">print</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">args</span><span class="o">.</span><span class="n">stats</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">avg</span> <span class="o">=</span> <span class="n">total_chars</span> <span class="o">/</span> <span class="n">total_chunks</span> <span class="k">if</span> <span class="n">total_chunks</span> <span class="k">else</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Files:  </span><span class="si">{</span><span class="n">total_files</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Chunks: </span><span class="si">{</span><span class="n">total_chunks</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Avg:    </span><span class="si">{</span><span class="n">avg</span><span class="si">:</span><span class="s2">.0f</span><span class="si">}</span><span class="s2"> chars/chunk&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Total:  </span><span class="si">{</span><span class="n">total_chars</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> chars&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">&#34;__main__&#34;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">main</span><span class="p">()</span>
</span></span></code></pre></div><p>Запуск:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># Посмотреть нарезку</span>
</span></span><span class="line"><span class="cl">python3 chunk-markdown.py ~/docs/
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Только статистика</span>
</span></span><span class="line"><span class="cl">python3 chunk-markdown.py ~/docs/ --stats
</span></span><span class="line"><span class="cl"><span class="c1"># Files:  &lt;N&gt;</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Chunks: &lt;N&gt;</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Avg:    &lt;N&gt; chars/chunk</span>
</span></span><span class="line"><span class="cl"><span class="c1"># Total:  &lt;N&gt; chars</span>
</span></span></code></pre></div><hr>
<h2 id="продакшен-параметры">Продакшен-параметры</h2>
<p>Актуальные параметры нашего pipeline (3 010 чанков):</p>
<table>
  <thead>
      <tr>
          <th>Параметр</th>
          <th>Значение</th>
          <th>Почему</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Модель</td>
          <td>mxbai-embed-large</td>
          <td>Лучшее качество на русском из Ollama (<a href="/posts/rag-02-embeddings/">пост 2/N</a>)</td>
      </tr>
      <tr>
          <td>Размерность</td>
          <td>1024</td>
          <td>Определяется моделью</td>
      </tr>
      <tr>
          <td>CHUNK_SIZE</td>
          <td>800 chars</td>
          <td>Binary search: 912 = worst-case лимит для 512 tokens, 800 = запас</td>
      </tr>
      <tr>
          <td>CHUNK_OVERLAP</td>
          <td>150 chars</td>
          <td>Контекст на границах, overlap &lt; 200 не раздувает базу</td>
      </tr>
      <tr>
          <td>Семантическая нарезка</td>
          <td>H2/H3 headings</td>
          <td>Логические блоки, не произвольные</td>
      </tr>
      <tr>
          <td>Санитизация</td>
          <td>\ufffd, control chars, long strings</td>
          <td>Чистые вектора = чистый поиск</td>
      </tr>
      <tr>
          <td>Quality gate</td>
          <td>min 20 chars</td>
          <td>Слишком короткие → шум</td>
      </tr>
      <tr>
          <td>Progressive truncation</td>
          <td>800→600→400</td>
          <td>Fallback при &ldquo;context length exceeded&rdquo;</td>
      </tr>
      <tr>
          <td>Change detection</td>
          <td>md5(content)</td>
          <td>Инкрементальная переиндексация</td>
      </tr>
      <tr>
          <td>Sync</td>
          <td>systemd timer, каждые 10 мин</td>
          <td>Re-index только изменённые</td>
      </tr>
      <tr>
          <td>Payload</td>
          <td>file, section, zone, project, mtime, hash</td>
          <td>Фильтрация + change detection</td>
      </tr>
      <tr>
          <td>Объём</td>
          <td>180+ файлов → 3 010 чанков</td>
          <td>Markdown база знаний</td>
      </tr>
  </tbody>
</table>
<h3 id="эволюция-параметров">Эволюция параметров</h3>
<table>
  <thead>
      <tr>
          <th>Версия</th>
          <th>CHUNK_SIZE</th>
          <th>Модель</th>
          <th>Объём</th>
          <th>Проблема</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>v1 (до мая 2026)</td>
          <td>1500</td>
          <td>mxbai-embed-large</td>
          <td>2 028</td>
          <td>HTTP 500, ~140 failures/run</td>
      </tr>
      <tr>
          <td>v2 (май 2026)</td>
          <td>800</td>
          <td>mxbai-embed-large</td>
          <td>3 010</td>
          <td>0 failures</td>
      </tr>
  </tbody>
</table>
<p>Разница: 1500→800 chars, добавлена sanitize + progressive truncation. Больше чанков (меньше размер = больше фрагментов), но ноль ошибок.</p>
<hr>
<h2 id="что-дальше">Что дальше</h2>
<p>Текст нарезан, вектора созданы, метаданные на месте. Но поиск по одним только векторам не всегда находит то, что нужно – косинусная близость теряет точные совпадения терминов:</p>
<ul>
<li><strong>RAG Pipeline 4/N – Гибридный поиск</strong> – BM25 (текстовый) + dense vectors (семантический), почему оба нужны одновременно, Qdrant sparse vectors, ранжирование</li>
</ul>
<hr>
<p>Telegram: <a href="https://t.me/DevITWay">@DevITWay</a>
Сайт: <a href="https://devopsway.ru/">devopsway.ru</a></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
