<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Bm25 on DevOps Way - Практические гайды</title>
    <link>https://devopsway.ru/tags/bm25/</link>
    <description>Recent content in Bm25 on DevOps Way - Практические гайды</description>
    <image>
      <title>DevOps Way - Практические гайды</title>
      <url>https://devopsway.ru/images/devopsway-og.png</url>
      <link>https://devopsway.ru/images/devopsway-og.png</link>
    </image>
    <generator>Hugo -- 0.162.0</generator>
    <language>ru</language>
    <lastBuildDate>Wed, 27 May 2026 05:12:46 -0400</lastBuildDate>
    <atom:link href="https://devopsway.ru/tags/bm25/feed.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>RAG Pipeline 4/N: Гибридный поиск – почему одних векторов мало</title>
      <link>https://devopsway.ru/posts/rag-04-hybrid-search/</link>
      <pubDate>Wed, 27 May 2026 06:00:00 +0300</pubDate>
      <guid>https://devopsway.ru/posts/rag-04-hybrid-search/</guid>
      <description>Dense-векторы теряют точные совпадения, BM25 не понимает смысл. Гибридный поиск объединяет оба метода через Reciprocal Rank Fusion. Реальный код, сравнение на живых данных, продакшен-параметры.</description>
      <content:encoded><![CDATA[<table>
	<thead>
			<tr>
					<th>Параметр</th>
					<th>Значение</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>Bloom</td>
					<td>L4 (Анализ)</td>
			</tr>
			<tr>
					<td>SFIA</td>
					<td>Уровень 3</td>
			</tr>
			<tr>
					<td>Dreyfus</td>
					<td>Competent</td>
			</tr>
			<tr>
					<td>Артефакт</td>
					<td>BM25 encoder + RRF fusion скрипт</td>
			</tr>
			<tr>
					<td>Проверка</td>
					<td>Hybrid search находит результаты, которые dense и sparse пропускают по отдельности</td>
			</tr>
	</tbody>
</table>
<hr>
<h2 id="tldr">TL;DR</h2>
<p>Dense-векторы находят похожее по смыслу, но теряют точные совпадения команд. BM25 находит ключевые слова, но не понимает синонимы. Гибридный поиск запускает оба параллельно и сливает результаты через RRF (Reciprocal Rank Fusion). Веса: dense 0.7, sparse 0.3. Итого ~80ms на запрос.</p>
<hr>
<h2 id="проблема-dense-search-теряет-точные-совпадения">Проблема: dense search теряет точные совпадения</h2>
<p>В <a href="/posts/rag-03-chunking/">прошлом посте</a> мы нарезали текст на чанки и отправили в Qdrant. Каждый чанк – вектор из 1024 чисел. Поиск – это косинусная близость между вектором запроса и вектором чанка.</p>
<p>Проблема: для embedding-модели <code>&quot;настройка&quot;</code> и <code>&quot;конфигурирование&quot;</code> – одно и то же. Это хорошо. Но <code>&quot;docker compose&quot;</code> и <code>&quot;оркестрация контейнеров&quot;</code> – тоже почти одно и то же. Это плохо, когда вы ищете конкретную команду.</p>
<p>Реальный пример из моего pipeline. Запрос: <code>&quot;kubectl apply deployment yaml&quot;</code>.</p>
<p><strong>Dense search</strong> (только вектора):</p>
<table>
	<thead>
			<tr>
					<th>#</th>
					<th>Score</th>
					<th>Результат</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>1</td>
					<td>0.816</td>
					<td><code>11.2.1 КТ3: deployment.yaml</code> – просто имя файла</td>
			</tr>
			<tr>
					<td>2</td>
					<td>0.793</td>
					<td><code>Task 11.1.2: KT6 — Simplify deployment.yaml</code> – тоже имя файла</td>
			</tr>
			<tr>
					<td>3</td>
					<td>0.774</td>
					<td><code>Ошибка с kustomize – нужно запускать... kubectl apply напрямую к deployment.yaml</code></td>
			</tr>
	</tbody>
</table>
<p>Dense нашёл что-то &ldquo;про deployment.yaml&rdquo;, но первые два результата – списки файлов, а не объяснение, как работает <code>kubectl apply</code>. Модель видит семантику (&ldquo;deployment&rdquo;, &ldquo;yaml&rdquo;), но не различает упоминание от объяснения.</p>
<p><strong>BM25 search</strong> (только ключевые слова):</p>
<table>
	<thead>
			<tr>
					<th>#</th>
					<th>Score</th>
					<th>Результат</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>1</td>
					<td>203.1</td>
					<td><code>kubectl apply -f pod.yaml / kubectl apply -f deployment.yaml</code> – полный пример с объяснением декларативного подхода</td>
			</tr>
			<tr>
					<td>2</td>
					<td>172.8</td>
					<td><code>kubectl apply напрямую к deployment.yaml</code></td>
			</tr>
			<tr>
					<td>3</td>
					<td>167.0</td>
					<td><code>kubectl apply -n argocd -f install.yaml</code> – реальные команды из лабы</td>
			</tr>
	</tbody>
</table>
<p>BM25 нашёл точные совпадения с <code>kubectl apply</code>. Первый результат – полное объяснение перехода от императивного к декларативному подходу.</p>
<p><strong>Hybrid search</strong> (оба + RRF fusion):</p>
<table>
	<thead>
			<tr>
					<th>#</th>
					<th>RRF Score</th>
					<th>Результат</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>1</td>
					<td>0.016</td>
					<td><code>Ошибка с kustomize – используйте kubectl apply напрямую к deployment.yaml</code></td>
			</tr>
			<tr>
					<td>2</td>
					<td>0.014</td>
					<td><code>Task 9.1.5: YAML Manifests – декларативный подход: kubectl apply -f deployment.yaml</code></td>
			</tr>
			<tr>
					<td>3</td>
					<td>0.014</td>
					<td><code>Жизненный цикл Deployment: от kubectl apply до работающего пода</code> (ASCII-диаграмма API Server)</td>
			</tr>
	</tbody>
</table>
<p>Hybrid поднял наверх чанк, который содержит и ключевые слова (<code>kubectl apply</code>), и семантику (объяснение lifecycle). Ни dense, ни sparse по отдельности не дали такой результат в топ-3.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">Dense:   &#34;deployment.yaml&#34; ←── семантические соседи (имена файлов)
</span></span><span class="line"><span class="cl">Sparse:  &#34;kubectl apply -f&#34; ←── точные совпадения команд
</span></span><span class="line"><span class="cl">Hybrid:  &#34;от kubectl apply до работающего пода&#34; ←── и команды, и смысл
</span></span></code></pre></div><hr>
<h2 id="как-работает-bm25-для-русского-текста">Как работает BM25 для русского текста</h2>
<p>BM25 – это формула ранжирования из 1994 года. Она считает, насколько важно каждое слово запроса для каждого документа. Основа: если слово встречается часто в одном документе, но редко в корпусе – оно важное.</p>
<h3 id="формула">Формула</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">score(q, d) = Σ IDF(t) × TF(t, d) × (k1 + 1) / (TF(t, d) + k1 × (1 - b + b × |d| / avg_dl))
</span></span></code></pre></div><p>Где:</p>
<ul>
<li><strong>IDF(t)</strong> – обратная документная частота: <code>log((N - df + 0.5) / (df + 0.5) + 1)</code>. Слово в 2 документах из 200 000 – ценнее, чем слово в 50 000.</li>
<li><strong>TF(t, d)</strong> – сколько раз термин <code>t</code> встречается в документе <code>d</code>.</li>
<li><strong>k1 = 1.5</strong> – насыщение TF. При k1=0 количество повторений не важно. При k1=2 каждое повторение ещё добавляет вес.</li>
<li><strong>b = 0.75</strong> – нормализация по длине. При b=1 длинные документы штрафуются сильно. При b=0 длина не важна.</li>
<li><strong>|d| / avg_dl</strong> – отношение длины документа к средней длине в корпусе.</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">SimpleBM25</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">k1</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">1.5</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.75</span><span class="p">,</span> <span class="n">use_stemming</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="kc">True</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">k1</span> <span class="o">=</span> <span class="n">k1</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">b</span> <span class="o">=</span> <span class="n">b</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">use_stemming</span> <span class="o">=</span> <span class="n">use_stemming</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">vocab</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">int</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">idf</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">float</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">avg_dl</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">doc_count</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">encode</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">List</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;&#34;&#34;Encode text -&gt; sparse vector for Qdrant.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">        <span class="n">tokens</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_tokenize</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">tf</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">doc_len</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="n">indices</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">        <span class="n">values</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">term</span><span class="p">,</span> <span class="n">freq</span> <span class="ow">in</span> <span class="n">tf</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">term</span> <span class="ow">not</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">vocab</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="k">continue</span>
</span></span><span class="line"><span class="cl">            <span class="n">idx</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">vocab</span><span class="p">[</span><span class="n">term</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="n">idf</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">idf</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">term</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">            <span class="n">numerator</span> <span class="o">=</span> <span class="n">freq</span> <span class="o">*</span> <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">k1</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="n">denominator</span> <span class="o">=</span> <span class="n">freq</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">k1</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">b</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">b</span> <span class="o">*</span> <span class="n">doc_len</span> <span class="o">/</span> <span class="nb">max</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">avg_dl</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="n">score</span> <span class="o">=</span> <span class="n">idf</span> <span class="o">*</span> <span class="p">(</span><span class="n">numerator</span> <span class="o">/</span> <span class="n">denominator</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">score</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="n">indices</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">idx</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">                <span class="n">values</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">round</span><span class="p">(</span><span class="n">score</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="p">{</span><span class="s2">&#34;indices&#34;</span><span class="p">:</span> <span class="n">indices</span><span class="p">,</span> <span class="s2">&#34;values&#34;</span><span class="p">:</span> <span class="n">values</span><span class="p">}</span>
</span></span></code></pre></div><p>Результат <code>encode()</code> – sparse vector: массив пар <code>(index, value)</code>. Qdrant хранит их компактно: только ненулевые элементы. Типичный запрос из 4 слов даёт вектор с 4 ненулевыми позициями из словаря в 125 000+ терминов.</p>
<h3 id="проблема-1-русская-морфология">Проблема #1: русская морфология</h3>
<p>Без морфологии BM25 на русском работает значительно хуже. &ldquo;Настройка&rdquo;, &ldquo;настройки&rdquo;, &ldquo;настройку&rdquo;, &ldquo;настроить&rdquo; – для наивного BM25 это четыре разных слова. Запрос &ldquo;настройка docker&rdquo; не найдёт документ с &ldquo;настроить docker&rdquo;.</p>
<p>Решение: лемматизация через pymorphy.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">pymorphy3</span>
</span></span><span class="line"><span class="cl"><span class="n">MORPH</span> <span class="o">=</span> <span class="n">pymorphy3</span><span class="o">.</span><span class="n">MorphAnalyzer</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">_stem</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">word</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Привести слово к нормальной форме.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">re</span><span class="o">.</span><span class="k">match</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;^[а-яё]+$&#39;</span><span class="p">,</span> <span class="n">word</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">parsed</span> <span class="o">=</span> <span class="n">MORPH</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="n">word</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">parsed</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="n">parsed</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">normal_form</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">word</span>
</span></span></code></pre></div><p>Результат:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">настройка  → настройка
</span></span><span class="line"><span class="cl">настройки  → настройка
</span></span><span class="line"><span class="cl">настройку  → настройка
</span></span><span class="line"><span class="cl">настроить  → настроить   # другая часть речи, но всё равно найдётся по IDF
</span></span></code></pre></div><p>Pymorphy работает только с кириллицей. Латиница (docker, kubectl, nginx) проходит без изменений – технические термины и так неизменяемы.</p>
<h3 id="проблема-2-стоп-слова-vs-защищённые-токены">Проблема #2: стоп-слова vs защищённые токены</h3>
<p>Стоп-слова – высокочастотные слова, бесполезные для поиска. &ldquo;На&rdquo;, &ldquo;в&rdquo;, &ldquo;с&rdquo;, &ldquo;это&rdquo;, &ldquo;для&rdquo; – они встречаются в каждом документе и имеют нулевой IDF.</p>
<p>Но в DevOps-контексте многие &ldquo;обычные&rdquo; слова – технические команды:</p>
<table>
	<thead>
			<tr>
					<th>Слово</th>
					<th>Обычный язык</th>
					<th>DevOps-контекст</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td><code>get</code></td>
					<td>&ldquo;получить&rdquo;</td>
					<td><code>kubectl get pods</code></td>
			</tr>
			<tr>
					<td><code>set</code></td>
					<td>&ldquo;установить&rdquo;</td>
					<td><code>redis SET key value</code></td>
			</tr>
			<tr>
					<td><code>run</code></td>
					<td>&ldquo;бежать&rdquo;</td>
					<td><code>docker run nginx</code></td>
			</tr>
			<tr>
					<td><code>from</code></td>
					<td>&ldquo;от, из&rdquo;</td>
					<td><code>FROM ubuntu:22.04</code></td>
			</tr>
			<tr>
					<td><code>if</code></td>
					<td>&ldquo;если&rdquo;</td>
					<td><code>if [ -f /etc/nginx.conf ]</code></td>
			</tr>
			<tr>
					<td><code>in</code></td>
					<td>&ldquo;в&rdquo;</td>
					<td><code>for pod in $(kubectl get pods)</code></td>
			</tr>
			<tr>
					<td><code>on</code></td>
					<td>&ldquo;на&rdquo;</td>
					<td><code>ON DELETE CASCADE</code></td>
			</tr>
			<tr>
					<td><code>as</code></td>
					<td>&ldquo;как&rdquo;</td>
					<td><code>import pandas as pd</code></td>
			</tr>
	</tbody>
</table>
<p>Наивный BM25 выкинет <code>get</code>, <code>set</code>, <code>run</code>, <code>from</code> как стоп-слова. А это ключевые команды.</p>
<p>Решение: три списка стоп-слов и один список защиты.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Русские общие стоп-слова (НЕ из NLTK, вручную подобранные)</span>
</span></span><span class="line"><span class="cl"><span class="n">RUSSIAN_STOPWORDS</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Предлоги</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;на&#39;</span><span class="p">,</span> <span class="s1">&#39;в&#39;</span><span class="p">,</span> <span class="s1">&#39;во&#39;</span><span class="p">,</span> <span class="s1">&#39;с&#39;</span><span class="p">,</span> <span class="s1">&#39;со&#39;</span><span class="p">,</span> <span class="s1">&#39;к&#39;</span><span class="p">,</span> <span class="s1">&#39;ко&#39;</span><span class="p">,</span> <span class="s1">&#39;о&#39;</span><span class="p">,</span> <span class="s1">&#39;об&#39;</span><span class="p">,</span> <span class="s1">&#39;по&#39;</span><span class="p">,</span> <span class="s1">&#39;за&#39;</span><span class="p">,</span> <span class="s1">&#39;из&#39;</span><span class="p">,</span> <span class="s1">&#39;от&#39;</span><span class="p">,</span> <span class="s1">&#39;до&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;у&#39;</span><span class="p">,</span> <span class="s1">&#39;при&#39;</span><span class="p">,</span> <span class="s1">&#39;для&#39;</span><span class="p">,</span> <span class="s1">&#39;без&#39;</span><span class="p">,</span> <span class="s1">&#39;под&#39;</span><span class="p">,</span> <span class="s1">&#39;над&#39;</span><span class="p">,</span> <span class="s1">&#39;через&#39;</span><span class="p">,</span> <span class="s1">&#39;между&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Союзы</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;и&#39;</span><span class="p">,</span> <span class="s1">&#39;а&#39;</span><span class="p">,</span> <span class="s1">&#39;но&#39;</span><span class="p">,</span> <span class="s1">&#39;или&#39;</span><span class="p">,</span> <span class="s1">&#39;да&#39;</span><span class="p">,</span> <span class="s1">&#39;же&#39;</span><span class="p">,</span> <span class="s1">&#39;ли&#39;</span><span class="p">,</span> <span class="s1">&#39;ни&#39;</span><span class="p">,</span> <span class="s1">&#39;что&#39;</span><span class="p">,</span> <span class="s1">&#39;как&#39;</span><span class="p">,</span> <span class="s1">&#39;если&#39;</span><span class="p">,</span> <span class="s1">&#39;когда&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;чтобы&#39;</span><span class="p">,</span> <span class="s1">&#39;потому&#39;</span><span class="p">,</span> <span class="s1">&#39;поэтому&#39;</span><span class="p">,</span> <span class="s1">&#39;так&#39;</span><span class="p">,</span> <span class="s1">&#39;тоже&#39;</span><span class="p">,</span> <span class="s1">&#39;также&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Местоимения</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;я&#39;</span><span class="p">,</span> <span class="s1">&#39;ты&#39;</span><span class="p">,</span> <span class="s1">&#39;он&#39;</span><span class="p">,</span> <span class="s1">&#39;она&#39;</span><span class="p">,</span> <span class="s1">&#39;оно&#39;</span><span class="p">,</span> <span class="s1">&#39;мы&#39;</span><span class="p">,</span> <span class="s1">&#39;вы&#39;</span><span class="p">,</span> <span class="s1">&#39;они&#39;</span><span class="p">,</span> <span class="s1">&#39;это&#39;</span><span class="p">,</span> <span class="s1">&#39;то&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;мой&#39;</span><span class="p">,</span> <span class="s1">&#39;твой&#39;</span><span class="p">,</span> <span class="s1">&#39;его&#39;</span><span class="p">,</span> <span class="s1">&#39;её&#39;</span><span class="p">,</span> <span class="s1">&#39;их&#39;</span><span class="p">,</span> <span class="s1">&#39;наш&#39;</span><span class="p">,</span> <span class="s1">&#39;ваш&#39;</span><span class="p">,</span> <span class="s1">&#39;свой&#39;</span><span class="p">,</span> <span class="s1">&#39;кто&#39;</span><span class="p">,</span> <span class="s1">&#39;сам&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Частицы, наречия, связки</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;не&#39;</span><span class="p">,</span> <span class="s1">&#39;бы&#39;</span><span class="p">,</span> <span class="s1">&#39;вот&#39;</span><span class="p">,</span> <span class="s1">&#39;уже&#39;</span><span class="p">,</span> <span class="s1">&#39;ещё&#39;</span><span class="p">,</span> <span class="s1">&#39;только&#39;</span><span class="p">,</span> <span class="s1">&#39;очень&#39;</span><span class="p">,</span> <span class="s1">&#39;там&#39;</span><span class="p">,</span> <span class="s1">&#39;тут&#39;</span><span class="p">,</span> <span class="s1">&#39;где&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;быть&#39;</span><span class="p">,</span> <span class="s1">&#39;есть&#39;</span><span class="p">,</span> <span class="s1">&#39;был&#39;</span><span class="p">,</span> <span class="s1">&#39;была&#39;</span><span class="p">,</span> <span class="s1">&#39;было&#39;</span><span class="p">,</span> <span class="s1">&#39;были&#39;</span><span class="p">,</span> <span class="s1">&#39;будет&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;можно&#39;</span><span class="p">,</span> <span class="s1">&#39;нужно&#39;</span><span class="p">,</span> <span class="s1">&#39;надо&#39;</span><span class="p">,</span> <span class="s1">&#39;нельзя&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># ...ещё ~20 слов (вводные, наречия)</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Стоп-слова для чатов с LLM-ассистентом</span>
</span></span><span class="line"><span class="cl"><span class="n">CHAT_STOPWORDS</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Приветствия / вежливость</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;привет&#39;</span><span class="p">,</span> <span class="s1">&#39;здравствуйте&#39;</span><span class="p">,</span> <span class="s1">&#39;пожалуйста&#39;</span><span class="p">,</span> <span class="s1">&#39;спасибо&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Обращения к ассистенту</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;подскажи&#39;</span><span class="p">,</span> <span class="s1">&#39;помоги&#39;</span><span class="p">,</span> <span class="s1">&#39;объясни&#39;</span><span class="p">,</span> <span class="s1">&#39;расскажи&#39;</span><span class="p">,</span> <span class="s1">&#39;покажи&#39;</span><span class="p">,</span> <span class="s1">&#39;сделай&#39;</span><span class="p">,</span> <span class="s1">&#39;напиши&#39;</span><span class="p">,</span> <span class="s1">&#39;создай&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Реакции</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;отлично&#39;</span><span class="p">,</span> <span class="s1">&#39;хорошо&#39;</span><span class="p">,</span> <span class="s1">&#39;понял&#39;</span><span class="p">,</span> <span class="s1">&#39;понятно&#39;</span><span class="p">,</span> <span class="s1">&#39;готово&#39;</span><span class="p">,</span> <span class="s1">&#39;сделано&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Фразы-паразиты</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;давай&#39;</span><span class="p">,</span> <span class="s1">&#39;ладно&#39;</span><span class="p">,</span> <span class="s1">&#39;окей&#39;</span><span class="p">,</span> <span class="s1">&#39;типа&#39;</span><span class="p">,</span> <span class="s1">&#39;короче&#39;</span><span class="p">,</span> <span class="s1">&#39;вообще&#39;</span><span class="p">,</span> <span class="s1">&#39;просто&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Английские общие стоп-слова</span>
</span></span><span class="line"><span class="cl"><span class="n">ENGLISH_STOPWORDS</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;the&#39;</span><span class="p">,</span> <span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="s1">&#39;an&#39;</span><span class="p">,</span> <span class="s1">&#39;is&#39;</span><span class="p">,</span> <span class="s1">&#39;are&#39;</span><span class="p">,</span> <span class="s1">&#39;was&#39;</span><span class="p">,</span> <span class="s1">&#39;were&#39;</span><span class="p">,</span> <span class="s1">&#39;be&#39;</span><span class="p">,</span> <span class="s1">&#39;been&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;have&#39;</span><span class="p">,</span> <span class="s1">&#39;has&#39;</span><span class="p">,</span> <span class="s1">&#39;had&#39;</span><span class="p">,</span> <span class="s1">&#39;do&#39;</span><span class="p">,</span> <span class="s1">&#39;does&#39;</span><span class="p">,</span> <span class="s1">&#39;did&#39;</span><span class="p">,</span> <span class="s1">&#39;will&#39;</span><span class="p">,</span> <span class="s1">&#39;would&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;this&#39;</span><span class="p">,</span> <span class="s1">&#39;that&#39;</span><span class="p">,</span> <span class="s1">&#39;it&#39;</span><span class="p">,</span> <span class="s1">&#39;its&#39;</span><span class="p">,</span> <span class="s1">&#39;and&#39;</span><span class="p">,</span> <span class="s1">&#39;or&#39;</span><span class="p">,</span> <span class="s1">&#39;but&#39;</span><span class="p">,</span> <span class="s1">&#39;if&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;of&#39;</span><span class="p">,</span> <span class="s1">&#39;at&#39;</span><span class="p">,</span> <span class="s1">&#39;by&#39;</span><span class="p">,</span> <span class="s1">&#39;for&#39;</span><span class="p">,</span> <span class="s1">&#39;with&#39;</span><span class="p">,</span> <span class="s1">&#39;about&#39;</span><span class="p">,</span> <span class="s1">&#39;into&#39;</span><span class="p">,</span> <span class="s1">&#39;from&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;no&#39;</span><span class="p">,</span> <span class="s1">&#39;not&#39;</span><span class="p">,</span> <span class="s1">&#39;only&#39;</span><span class="p">,</span> <span class="s1">&#39;so&#39;</span><span class="p">,</span> <span class="s1">&#39;than&#39;</span><span class="p">,</span> <span class="s1">&#39;too&#39;</span><span class="p">,</span> <span class="s1">&#39;very&#39;</span><span class="p">,</span> <span class="s1">&#39;just&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># ...ещё ~30 слов</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>Три списка покрывают три источника шума: русская грамматика, болтовня с ассистентом, английские артикли/предлоги. Но без защиты они выкосят половину полезных терминов.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># ЗАЩИЩЁННЫЕ токены — НЕ удалять даже если в стоп-листе!</span>
</span></span><span class="line"><span class="cl"><span class="n">PROTECTED_TOKENS</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Bash команды</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;cd&#39;</span><span class="p">,</span> <span class="s1">&#39;ls&#39;</span><span class="p">,</span> <span class="s1">&#39;rm&#39;</span><span class="p">,</span> <span class="s1">&#39;cp&#39;</span><span class="p">,</span> <span class="s1">&#39;mv&#39;</span><span class="p">,</span> <span class="s1">&#39;cat&#39;</span><span class="p">,</span> <span class="s1">&#39;grep&#39;</span><span class="p">,</span> <span class="s1">&#39;sed&#39;</span><span class="p">,</span> <span class="s1">&#39;awk&#39;</span><span class="p">,</span> <span class="s1">&#39;find&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;head&#39;</span><span class="p">,</span> <span class="s1">&#39;tail&#39;</span><span class="p">,</span> <span class="s1">&#39;echo&#39;</span><span class="p">,</span> <span class="s1">&#39;touch&#39;</span><span class="p">,</span> <span class="s1">&#39;mkdir&#39;</span><span class="p">,</span> <span class="s1">&#39;chmod&#39;</span><span class="p">,</span> <span class="s1">&#39;chown&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;curl&#39;</span><span class="p">,</span> <span class="s1">&#39;wget&#39;</span><span class="p">,</span> <span class="s1">&#39;ssh&#39;</span><span class="p">,</span> <span class="s1">&#39;scp&#39;</span><span class="p">,</span> <span class="s1">&#39;rsync&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;ps&#39;</span><span class="p">,</span> <span class="s1">&#39;kill&#39;</span><span class="p">,</span> <span class="s1">&#39;top&#39;</span><span class="p">,</span> <span class="s1">&#39;df&#39;</span><span class="p">,</span> <span class="s1">&#39;du&#39;</span><span class="p">,</span> <span class="s1">&#39;free&#39;</span><span class="p">,</span> <span class="s1">&#39;tar&#39;</span><span class="p">,</span> <span class="s1">&#39;zip&#39;</span><span class="p">,</span> <span class="s1">&#39;unzip&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Docker</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;up&#39;</span><span class="p">,</span> <span class="s1">&#39;down&#39;</span><span class="p">,</span> <span class="s1">&#39;run&#39;</span><span class="p">,</span> <span class="s1">&#39;exec&#39;</span><span class="p">,</span> <span class="s1">&#39;build&#39;</span><span class="p">,</span> <span class="s1">&#39;push&#39;</span><span class="p">,</span> <span class="s1">&#39;pull&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;stop&#39;</span><span class="p">,</span> <span class="s1">&#39;start&#39;</span><span class="p">,</span> <span class="s1">&#39;restart&#39;</span><span class="p">,</span> <span class="s1">&#39;logs&#39;</span><span class="p">,</span> <span class="s1">&#39;images&#39;</span><span class="p">,</span> <span class="s1">&#39;prune&#39;</span><span class="p">,</span> <span class="s1">&#39;compose&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Git</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;add&#39;</span><span class="p">,</span> <span class="s1">&#39;commit&#39;</span><span class="p">,</span> <span class="s1">&#39;push&#39;</span><span class="p">,</span> <span class="s1">&#39;pull&#39;</span><span class="p">,</span> <span class="s1">&#39;fetch&#39;</span><span class="p">,</span> <span class="s1">&#39;merge&#39;</span><span class="p">,</span> <span class="s1">&#39;rebase&#39;</span><span class="p">,</span> <span class="s1">&#39;checkout&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;clone&#39;</span><span class="p">,</span> <span class="s1">&#39;init&#39;</span><span class="p">,</span> <span class="s1">&#39;status&#39;</span><span class="p">,</span> <span class="s1">&#39;diff&#39;</span><span class="p">,</span> <span class="s1">&#39;log&#39;</span><span class="p">,</span> <span class="s1">&#39;reset&#39;</span><span class="p">,</span> <span class="s1">&#39;stash&#39;</span><span class="p">,</span> <span class="s1">&#39;tag&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Kubernetes</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;get&#39;</span><span class="p">,</span> <span class="s1">&#39;set&#39;</span><span class="p">,</span> <span class="s1">&#39;apply&#39;</span><span class="p">,</span> <span class="s1">&#39;delete&#39;</span><span class="p">,</span> <span class="s1">&#39;describe&#39;</span><span class="p">,</span> <span class="s1">&#39;logs&#39;</span><span class="p">,</span> <span class="s1">&#39;exec&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;scale&#39;</span><span class="p">,</span> <span class="s1">&#39;rollout&#39;</span><span class="p">,</span> <span class="s1">&#39;expose&#39;</span><span class="p">,</span> <span class="s1">&#39;create&#39;</span><span class="p">,</span> <span class="s1">&#39;edit&#39;</span><span class="p">,</span> <span class="s1">&#39;patch&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Программирование (без этого поиск по коду не работает)</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;if&#39;</span><span class="p">,</span> <span class="s1">&#39;else&#39;</span><span class="p">,</span> <span class="s1">&#39;for&#39;</span><span class="p">,</span> <span class="s1">&#39;while&#39;</span><span class="p">,</span> <span class="s1">&#39;do&#39;</span><span class="p">,</span> <span class="s1">&#39;case&#39;</span><span class="p">,</span> <span class="s1">&#39;try&#39;</span><span class="p">,</span> <span class="s1">&#39;catch&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;return&#39;</span><span class="p">,</span> <span class="s1">&#39;break&#39;</span><span class="p">,</span> <span class="s1">&#39;continue&#39;</span><span class="p">,</span> <span class="s1">&#39;pass&#39;</span><span class="p">,</span> <span class="s1">&#39;raise&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;import&#39;</span><span class="p">,</span> <span class="s1">&#39;from&#39;</span><span class="p">,</span> <span class="s1">&#39;as&#39;</span><span class="p">,</span> <span class="s1">&#39;with&#39;</span><span class="p">,</span> <span class="s1">&#39;in&#39;</span><span class="p">,</span> <span class="s1">&#39;on&#39;</span><span class="p">,</span> <span class="s1">&#39;not&#39;</span><span class="p">,</span> <span class="s1">&#39;is&#39;</span><span class="p">,</span> <span class="s1">&#39;or&#39;</span><span class="p">,</span> <span class="s1">&#39;and&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;def&#39;</span><span class="p">,</span> <span class="s1">&#39;class&#39;</span><span class="p">,</span> <span class="s1">&#39;fn&#39;</span><span class="p">,</span> <span class="s1">&#39;func&#39;</span><span class="p">,</span> <span class="s1">&#39;function&#39;</span><span class="p">,</span> <span class="s1">&#39;let&#39;</span><span class="p">,</span> <span class="s1">&#39;const&#39;</span><span class="p">,</span> <span class="s1">&#39;var&#39;</span><span class="p">,</span> <span class="s1">&#39;mut&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;true&#39;</span><span class="p">,</span> <span class="s1">&#39;false&#39;</span><span class="p">,</span> <span class="s1">&#39;null&#39;</span><span class="p">,</span> <span class="s1">&#39;none&#39;</span><span class="p">,</span> <span class="s1">&#39;nil&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># HTTP</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;get&#39;</span><span class="p">,</span> <span class="s1">&#39;post&#39;</span><span class="p">,</span> <span class="s1">&#39;put&#39;</span><span class="p">,</span> <span class="s1">&#39;patch&#39;</span><span class="p">,</span> <span class="s1">&#39;delete&#39;</span><span class="p">,</span> <span class="s1">&#39;head&#39;</span><span class="p">,</span> <span class="s1">&#39;options&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># API/Протоколы</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;api&#39;</span><span class="p">,</span> <span class="s1">&#39;http&#39;</span><span class="p">,</span> <span class="s1">&#39;https&#39;</span><span class="p">,</span> <span class="s1">&#39;tcp&#39;</span><span class="p">,</span> <span class="s1">&#39;udp&#39;</span><span class="p">,</span> <span class="s1">&#39;ssh&#39;</span><span class="p">,</span> <span class="s1">&#39;ftp&#39;</span><span class="p">,</span> <span class="s1">&#39;dns&#39;</span><span class="p">,</span> <span class="s1">&#39;ssl&#39;</span><span class="p">,</span> <span class="s1">&#39;tls&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># SQL</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;select&#39;</span><span class="p">,</span> <span class="s1">&#39;insert&#39;</span><span class="p">,</span> <span class="s1">&#39;update&#39;</span><span class="p">,</span> <span class="s1">&#39;delete&#39;</span><span class="p">,</span> <span class="s1">&#39;from&#39;</span><span class="p">,</span> <span class="s1">&#39;where&#39;</span><span class="p">,</span> <span class="s1">&#39;join&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;order&#39;</span><span class="p">,</span> <span class="s1">&#39;by&#39;</span><span class="p">,</span> <span class="s1">&#39;group&#39;</span><span class="p">,</span> <span class="s1">&#39;having&#39;</span><span class="p">,</span> <span class="s1">&#39;limit&#39;</span><span class="p">,</span> <span class="s1">&#39;offset&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># DevOps инструменты</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;docker&#39;</span><span class="p">,</span> <span class="s1">&#39;kubernetes&#39;</span><span class="p">,</span> <span class="s1">&#39;k8s&#39;</span><span class="p">,</span> <span class="s1">&#39;helm&#39;</span><span class="p">,</span> <span class="s1">&#39;ansible&#39;</span><span class="p">,</span> <span class="s1">&#39;terraform&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;nginx&#39;</span><span class="p">,</span> <span class="s1">&#39;redis&#39;</span><span class="p">,</span> <span class="s1">&#39;postgres&#39;</span><span class="p">,</span> <span class="s1">&#39;mysql&#39;</span><span class="p">,</span> <span class="s1">&#39;mongo&#39;</span><span class="p">,</span> <span class="s1">&#39;elastic&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;prometheus&#39;</span><span class="p">,</span> <span class="s1">&#39;grafana&#39;</span><span class="p">,</span> <span class="s1">&#39;loki&#39;</span><span class="p">,</span> <span class="s1">&#39;vault&#39;</span><span class="p">,</span> <span class="s1">&#39;consul&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Языки и фреймворки</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;python&#39;</span><span class="p">,</span> <span class="s1">&#39;rust&#39;</span><span class="p">,</span> <span class="s1">&#39;go&#39;</span><span class="p">,</span> <span class="s1">&#39;java&#39;</span><span class="p">,</span> <span class="s1">&#39;node&#39;</span><span class="p">,</span> <span class="s1">&#39;npm&#39;</span><span class="p">,</span> <span class="s1">&#39;yarn&#39;</span><span class="p">,</span> <span class="s1">&#39;pip&#39;</span><span class="p">,</span> <span class="s1">&#39;cargo&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;react&#39;</span><span class="p">,</span> <span class="s1">&#39;vue&#39;</span><span class="p">,</span> <span class="s1">&#39;django&#39;</span><span class="p">,</span> <span class="s1">&#39;flask&#39;</span><span class="p">,</span> <span class="s1">&#39;fastapi&#39;</span><span class="p">,</span> <span class="s1">&#39;axum&#39;</span><span class="p">,</span> <span class="s1">&#39;tokio&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># Финальный список: три источника шума МИНУС защита</span>
</span></span><span class="line"><span class="cl"><span class="n">ALL_STOPWORDS</span> <span class="o">=</span> <span class="p">(</span><span class="n">RUSSIAN_STOPWORDS</span> <span class="o">|</span> <span class="n">ENGLISH_STOPWORDS</span> <span class="o">|</span> <span class="n">CHAT_STOPWORDS</span><span class="p">)</span> <span class="o">-</span> <span class="n">PROTECTED_TOKENS</span>
</span></span></code></pre></div><p>В моём pipeline: 232 стоп-слова, 179 защищённых токенов. Пересечение (слова, которые есть в обоих списках, но защита побеждает): <code>get</code>, <code>set</code>, <code>from</code>, <code>in</code>, <code>on</code>, <code>as</code>, <code>if</code>, <code>do</code>, <code>for</code>, <code>with</code>, <code>not</code>, <code>is</code>, <code>or</code>, <code>and</code>, <code>no</code>, <code>all</code>, <code>by</code>.</p>
<p>Демонстрация на реальном запросе:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">Input: &#34;Привет! Подскажи как настроить docker compose для production&#34;
</span></span><span class="line"><span class="cl">                ↓ токенизация + стемминг + фильтрация
</span></span><span class="line"><span class="cl">Output: [&#39;настроить&#39;, &#39;docker&#39;, &#39;compose&#39;, &#39;production&#39;]
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Удалены:
</span></span><span class="line"><span class="cl">  &#34;Привет&#34;    → чат-стоп-слово
</span></span><span class="line"><span class="cl">  &#34;Подскажи&#34;  → чат-стоп-слово (обращение к ассистенту)
</span></span><span class="line"><span class="cl">  &#34;как&#34;       → русское стоп-слово (союз)
</span></span><span class="line"><span class="cl">  &#34;для&#34;       → русское стоп-слово (предлог)
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Защищены:
</span></span><span class="line"><span class="cl">  &#34;docker&#34;    → в PROTECTED_TOKENS (DevOps-инструмент)
</span></span><span class="line"><span class="cl">  &#34;compose&#34;   → в PROTECTED_TOKENS (Docker-команда)
</span></span></code></pre></div><h3 id="построение-словаря-fit">Построение словаря (fit)</h3>
<p>BM25 требует обучения на корпусе – нужен словарь и IDF для каждого термина:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">fit</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">documents</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="s1">&#39;SimpleBM25&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Построить словарь и IDF по корпусу.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">doc_freqs</span><span class="p">:</span> <span class="n">Counter</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">total_len</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">documents</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">tokens</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_tokenize</span><span class="p">(</span><span class="n">doc</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">total_len</span> <span class="o">+=</span> <span class="nb">len</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">token</span> <span class="ow">in</span> <span class="nb">set</span><span class="p">(</span><span class="n">tokens</span><span class="p">):</span>  <span class="c1"># unique per document</span>
</span></span><span class="line"><span class="cl">            <span class="n">doc_freqs</span><span class="p">[</span><span class="n">token</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="bp">self</span><span class="o">.</span><span class="n">doc_count</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">documents</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="bp">self</span><span class="o">.</span><span class="n">avg_dl</span> <span class="o">=</span> <span class="n">total_len</span> <span class="o">/</span> <span class="bp">self</span><span class="o">.</span><span class="n">doc_count</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Фильтр: только термины в 2+ документах</span>
</span></span><span class="line"><span class="cl">    <span class="n">min_df</span> <span class="o">=</span> <span class="mi">2</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">term</span><span class="p">,</span> <span class="n">df</span> <span class="ow">in</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">doc_freqs</span><span class="o">.</span><span class="n">items</span><span class="p">()):</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">df</span> <span class="o">&gt;=</span> <span class="n">min_df</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">vocab</span><span class="p">[</span><span class="n">term</span><span class="p">]</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">vocab</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">idf</span><span class="p">[</span><span class="n">term</span><span class="p">]</span> <span class="o">=</span> <span class="n">math</span><span class="o">.</span><span class="n">log</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">                <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">doc_count</span> <span class="o">-</span> <span class="n">df</span> <span class="o">+</span> <span class="mf">0.5</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">df</span> <span class="o">+</span> <span class="mf">0.5</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">            <span class="p">)</span>
</span></span></code></pre></div><p><code>min_df = 2</code> – фильтруем термины, встречающиеся только в одном документе. Это опечатки, hash-фрагменты, UUID. Они бесполезны для поиска и раздувают словарь.</p>
<p>Обученная модель сериализуется в JSON (~5 MB для 200K+ документов) и загружается при старте за миллисекунды.</p>
<hr>
<h2 id="sparse-vectors-в-qdrant">Sparse vectors в Qdrant</h2>
<p>Qdrant хранит два типа векторов в одной коллекции:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">qdrant_client</span> <span class="kn">import</span> <span class="n">QdrantClient</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">qdrant_client.models</span> <span class="kn">import</span> <span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">VectorParams</span><span class="p">,</span> <span class="n">SparseVectorParams</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">Distance</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">client</span><span class="o">.</span><span class="n">create_collection</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">collection_name</span><span class="o">=</span><span class="s2">&#34;hybrid_collection&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">vectors_config</span><span class="o">=</span><span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;dense&#34;</span><span class="p">:</span> <span class="n">VectorParams</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">size</span><span class="o">=</span><span class="mi">1024</span><span class="p">,</span>        <span class="c1"># mxbai-embed-large</span>
</span></span><span class="line"><span class="cl">            <span class="n">distance</span><span class="o">=</span><span class="n">Distance</span><span class="o">.</span><span class="n">COSINE</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="n">sparse_vectors_config</span><span class="o">=</span><span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;sparse&#34;</span><span class="p">:</span> <span class="n">SparseVectorParams</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><p>При загрузке каждый чанк получает оба вектора:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">qdrant_client.models</span> <span class="kn">import</span> <span class="n">PointStruct</span><span class="p">,</span> <span class="n">SparseVector</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">point</span> <span class="o">=</span> <span class="n">PointStruct</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="nb">id</span><span class="o">=</span><span class="n">uuid4</span><span class="p">()</span><span class="o">.</span><span class="n">hex</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">vector</span><span class="o">=</span><span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;dense&#34;</span><span class="p">:</span> <span class="n">embedding</span><span class="p">,</span>    <span class="c1"># [0.023, -0.117, ...] — 1024 float</span>
</span></span><span class="line"><span class="cl">        <span class="s2">&#34;sparse&#34;</span><span class="p">:</span> <span class="n">SparseVector</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">indices</span><span class="o">=</span><span class="p">[</span><span class="mi">42</span><span class="p">,</span> <span class="mi">1337</span><span class="p">,</span> <span class="mi">8080</span><span class="p">],</span>   <span class="c1"># номера слов в словаре</span>
</span></span><span class="line"><span class="cl">            <span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="mf">2.31</span><span class="p">,</span> <span class="mf">1.87</span><span class="p">,</span> <span class="mf">0.94</span><span class="p">]</span>   <span class="c1"># BM25-скоры</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="p">},</span>
</span></span><span class="line"><span class="cl">    <span class="n">payload</span><span class="o">=</span><span class="p">{</span><span class="s2">&#34;text&#34;</span><span class="p">:</span> <span class="n">chunk_text</span><span class="p">,</span> <span class="s2">&#34;file&#34;</span><span class="p">:</span> <span class="s2">&#34;guide.md&#34;</span><span class="p">,</span> <span class="o">...</span><span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>
</span></span></code></pre></div><p>Dense vector: 1024 float32 = 4 KB на точку. Sparse vector: в среднем 30-50 ненулевых элементов = ~400 байт. Накладные расходы sparse – ~10% от dense.</p>
<hr>
<h2 id="rrf-reciprocal-rank-fusion">RRF: Reciprocal Rank Fusion</h2>
<p>У нас два ранжированных списка: dense top-50 и sparse top-50. Нужно объединить их в один.</p>
<p>Наивный подход (объединить по скорам напрямую) не работает. Dense score (cosine similarity) лежит в диапазоне [0, 1]. BM25 score – в диапазоне [0, ∞]. Они несопоставимы.</p>
<p>RRF решает это иначе: забываем про скоры, используем только позиции в ранжировании.</p>
<h3 id="формула-1">Формула</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">RRF_score(d) = Σ weight_i / (K + rank_i(d))
</span></span></code></pre></div><p>Для двух списков:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">RRF_score(d) = 0.7 / (60 + rank_dense(d)) + 0.3 / (60 + rank_sparse(d))
</span></span></code></pre></div><ul>
<li><strong>K = 60</strong> – сглаживающий параметр. Чем выше K, тем меньше разница между позициями. При K=2 (стандарт Qdrant) позиция 1 в 20 раз ценнее позиции 60. При K=60 только в 2 раза.</li>
<li><strong>weight_dense = 0.7</strong> – семантика доминирует.</li>
<li><strong>weight_sparse = 0.3</strong> – ключевые слова подтягивают точные совпадения.</li>
</ul>
<h3 id="почему-k60-а-не-стандартные-2">Почему K=60, а не стандартные 2?</h3>
<p>Стандартный RRF (K=2) слишком агрессивно штрафует позицию. Документ на 10-й позиции в dense получает score <code>0.7 / 12 = 0.058</code>. На 1-й: <code>0.7 / 3 = 0.233</code>. Разница в 4 раза. Это значит, что sparse-бустинг почти не может поднять документ с 10-й позиции.</p>
<p>При K=60: 10-я позиция = <code>0.7 / 70 = 0.010</code>, 1-я = <code>0.7 / 61 = 0.011</code>. Разница 10%. Sparse-бустинг реально влияет на ранжирование даже для документов не из топ-5.</p>
<h3 id="реализация">Реализация</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">hybrid_search</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">limit</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">10</span><span class="p">,</span> <span class="n">mode</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s2">&#34;hybrid&#34;</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Получаем embedding от Ollama (~50ms)</span>
</span></span><span class="line"><span class="cl">    <span class="n">dense_vec</span> <span class="o">=</span> <span class="n">get_embedding</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Получаем BM25 sparse vector (&lt;5ms)</span>
</span></span><span class="line"><span class="cl">    <span class="n">sparse_vec</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">bm25</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Prefetch: 5x кандидатов от каждого метода</span>
</span></span><span class="line"><span class="cl">    <span class="n">prefetch_limit</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">limit</span> <span class="o">*</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">50</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Два параллельных запроса к Qdrant</span>
</span></span><span class="line"><span class="cl">    <span class="n">dense_results</span> <span class="o">=</span> <span class="n">qdrant</span><span class="o">.</span><span class="n">query_points</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">query</span><span class="o">=</span><span class="n">dense_vec</span><span class="p">,</span> <span class="n">using</span><span class="o">=</span><span class="s2">&#34;dense&#34;</span><span class="p">,</span> <span class="n">limit</span><span class="o">=</span><span class="n">prefetch_limit</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span><span class="o">.</span><span class="n">points</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">sparse_results</span> <span class="o">=</span> <span class="n">qdrant</span><span class="o">.</span><span class="n">query_points</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">query</span><span class="o">=</span><span class="n">SparseVector</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">indices</span><span class="o">=</span><span class="n">sparse_vec</span><span class="p">[</span><span class="s2">&#34;indices&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">            <span class="n">values</span><span class="o">=</span><span class="n">sparse_vec</span><span class="p">[</span><span class="s2">&#34;values&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="p">),</span>
</span></span><span class="line"><span class="cl">        <span class="n">using</span><span class="o">=</span><span class="s2">&#34;sparse&#34;</span><span class="p">,</span> <span class="n">limit</span><span class="o">=</span><span class="n">prefetch_limit</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span><span class="o">.</span><span class="n">points</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Weighted RRF fusion</span>
</span></span><span class="line"><span class="cl">    <span class="n">DENSE_WEIGHT</span> <span class="o">=</span> <span class="mf">0.7</span>
</span></span><span class="line"><span class="cl">    <span class="n">SPARSE_WEIGHT</span> <span class="o">=</span> <span class="mf">0.3</span>
</span></span><span class="line"><span class="cl">    <span class="n">RRF_K</span> <span class="o">=</span> <span class="mi">60</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">scores</span> <span class="o">=</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl">    <span class="n">payloads</span> <span class="o">=</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">rank</span><span class="p">,</span> <span class="n">r</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">dense_results</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">rid</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">id</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">scores</span><span class="p">[</span><span class="n">rid</span><span class="p">]</span> <span class="o">=</span> <span class="n">DENSE_WEIGHT</span> <span class="o">/</span> <span class="p">(</span><span class="n">RRF_K</span> <span class="o">+</span> <span class="n">rank</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">payloads</span><span class="p">[</span><span class="n">rid</span><span class="p">]</span> <span class="o">=</span> <span class="n">r</span><span class="o">.</span><span class="n">payload</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">rank</span><span class="p">,</span> <span class="n">r</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">sparse_results</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">rid</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">id</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="c1"># Документ из обоих списков получает сумму обоих скоров</span>
</span></span><span class="line"><span class="cl">        <span class="n">scores</span><span class="p">[</span><span class="n">rid</span><span class="p">]</span> <span class="o">=</span> <span class="n">scores</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">rid</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="n">SPARSE_WEIGHT</span> <span class="o">/</span> <span class="p">(</span><span class="n">RRF_K</span> <span class="o">+</span> <span class="n">rank</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">rid</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">payloads</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">payloads</span><span class="p">[</span><span class="n">rid</span><span class="p">]</span> <span class="o">=</span> <span class="n">r</span><span class="o">.</span><span class="n">payload</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Сортировка: лучшие — те, кто набрал скор из ОБОИХ списков</span>
</span></span><span class="line"><span class="cl">    <span class="n">ranked</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">scores</span><span class="o">.</span><span class="n">items</span><span class="p">(),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)[:</span><span class="n">limit</span><span class="p">]</span>
</span></span></code></pre></div><p>Ключевое: <code>scores.get(rid, 0) +</code>. Документ, найденный обоими методами, получает <strong>сумму</strong> скоров. Документ только из dense получает только dense-скор. Документ только из sparse получает только sparse-скор. Пересечение побеждает.</p>
<h3 id="fallback">Fallback</h3>
<p>Если BM25-модель не загружена или запрос состоит из одних стоп-слов (sparse vector пустой) – автоматический откат на dense-only:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">if</span> <span class="ow">not</span> <span class="n">sparse_vec</span> <span class="ow">or</span> <span class="ow">not</span> <span class="n">sparse_vec</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&#34;indices&#34;</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Fallback: dense only</span>
</span></span><span class="line"><span class="cl">    <span class="n">results</span> <span class="o">=</span> <span class="n">qdrant</span><span class="o">.</span><span class="n">query_points</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">query</span><span class="o">=</span><span class="n">dense_vec</span><span class="p">,</span> <span class="n">using</span><span class="o">=</span><span class="s2">&#34;dense&#34;</span><span class="p">,</span> <span class="n">limit</span><span class="o">=</span><span class="n">limit</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span><span class="o">.</span><span class="n">points</span>
</span></span></code></pre></div><hr>
<h2 id="сравнение-на-живых-данных">Сравнение на живых данных</h2>
<p>Запрос: <code>&quot;docker compose настройка&quot;</code>. Корпус: 235 000+ точек в Qdrant (3K чанков базы знаний из <a href="/posts/rag-03-chunking/">поста 3/N</a> + рабочие сессии).</p>
<p><strong>Dense (семантический):</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">1. [0.865] &#34;docker-compose не найден. Попробую через docker compose (новая версия)&#34;
</span></span><span class="line"><span class="cl">2. [0.862] &#34;Теперь обновлю docker-compose и перезапущу&#34;
</span></span><span class="line"><span class="cl">3. [0.860] &#34;Контейнер в правильной сети. Попробую пересоздать его через docker compose up&#34;
</span></span></code></pre></div><p>Все три – короткие операционные фразы. Dense видит &ldquo;docker compose&rdquo; как тему, но не различает упоминание от инструкции.</p>
<p><strong>Sparse (BM25):</strong></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">1. [143.1] Токенизация: &#34;настроить docker compose production&#34; — тест BM25-encoder
</span></span><span class="line"><span class="cl">2. [126.3] &#34;Удалены: Привет, Подскажи. Стемминг: настройки → настройка. Защищены: docker, compose&#34;
</span></span><span class="line"><span class="cl">3. [117.7] Сравнение search методов: &#34;Query: docker compose настройка → DENSE vs SPARSE&#34;
</span></span></code></pre></div><p>BM25 нашёл точные совпадения всех трёх слов запроса. Первый результат – документация самого BM25-encoder, где есть буквально <code>&quot;настроить docker compose&quot;</code>. Точное попадание по ключевым словам.</p>
<p><strong>Hybrid (RRF):</strong></p>
<p>Топ-результаты содержат и семантическую релевантность (чанки про работу с docker compose), и ключевые слова (конкретные команды и настройки). Документы, попавшие в оба списка, получают бустинг.</p>
<hr>
<h2 id="тайминг">Тайминг</h2>
<table>
	<thead>
			<tr>
					<th>Этап</th>
					<th>Время</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>Dense embedding (Ollama, mxbai-embed-large)</td>
					<td>~50ms</td>
			</tr>
			<tr>
					<td>BM25 encode (Python, in-memory vocab)</td>
					<td>~5ms</td>
			</tr>
			<tr>
					<td>Qdrant dense search (1024d, cosine, 235K points)</td>
					<td>~15ms</td>
			</tr>
			<tr>
					<td>Qdrant sparse search</td>
					<td>~10ms</td>
			</tr>
			<tr>
					<td>RRF fusion (Python dict merge + sort)</td>
					<td>&lt;1ms</td>
			</tr>
			<tr>
					<td><strong>Итого hybrid</strong></td>
					<td><strong>~80ms</strong></td>
			</tr>
	</tbody>
</table>
<p>Для сравнения: один запрос к LLM – 2-10 секунд. 80ms на поиск – копейки на фоне LLM inference.</p>
<hr>
<h2 id="когда-какой-режим">Когда какой режим</h2>
<table>
	<thead>
			<tr>
					<th>Запрос</th>
					<th>Лучший режим</th>
					<th>Почему</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td><code>&quot;docker compose настройка&quot;</code></td>
					<td>hybrid</td>
					<td>Нужны и ключевые слова, и семантика</td>
			</tr>
			<tr>
					<td><code>&quot;как работает Service в Kubernetes&quot;</code></td>
					<td>dense</td>
					<td>Семантический вопрос, нет конкретных команд</td>
			</tr>
			<tr>
					<td><code>&quot;kubectl get pods -n kube-system&quot;</code></td>
					<td>sparse</td>
					<td>Точная команда, семантика не нужна</td>
			</tr>
			<tr>
					<td><code>&quot;проблемы с сетью контейнеров&quot;</code></td>
					<td>dense</td>
					<td>Абстрактный запрос, BM25 не поможет</td>
			</tr>
			<tr>
					<td><code>&quot;nginx proxy_pass upstream&quot;</code></td>
					<td>hybrid</td>
					<td>Технические термины + контекст</td>
			</tr>
	</tbody>
</table>
<p>В моём pipeline по умолчанию всегда hybrid. Fallback на dense – только если BM25-модель не загружена.</p>
<hr>
<h2 id="мини-тест">Мини-тест</h2>
<p><strong>1. Вы ищете <code>&quot;docker run nginx&quot;</code>. Почему naive BM25 (без protected tokens) найдёт всё, кроме того, что нужно?</strong></p>
<details>
<summary>Ответ</summary>
<p><code>run</code> – английское стоп-слово (&ldquo;бежать&rdquo;). Naive BM25 его выкинет. Останется только <code>docker</code> и <code>nginx</code>. Поиск вернёт любые чанки, где упоминаются Docker и Nginx вместе: Dockerfile, docker-compose, конфиг nginx – но не конкретно <code>docker run</code>. С protected tokens <code>run</code> сохраняется, и BM25 находит именно запуск контейнера.</p>
</details>
<p><strong>2. Запрос: <code>&quot;как поднять сервисы в фоне&quot;</code>. Какой режим поиска сработает лучше и почему?</strong></p>
<details>
<summary>Ответ</summary>
<p>Dense. В запросе нет ни одной конкретной команды – только человеческий язык. BM25 будет искать слова &ldquo;поднять&rdquo;, &ldquo;сервисы&rdquo;, &ldquo;фоне&rdquo; буквально и, скорее всего, ничего релевантного не найдёт. Dense поймёт смысл и свяжет запрос с <code>docker compose up -d</code>, <code>systemctl start</code>, <code>nohup</code> – потому что embedding-модель знает, что &ldquo;поднять в фоне&rdquo; семантически близко к &ldquo;запустить как демон&rdquo;.</p>
<p>Hybrid тоже сработает (dense-часть вытянет), но sparse-часть ничего полезного не добавит.</p>
</details>
<p><strong>3. Вы добавили в корпус 10 000 новых документов про Cilium, но забыли переобучить BM25. Что сломается?</strong></p>
<details>
<summary>Ответ</summary>
<p>Dense search найдёт новые документы без проблем – embedding-модель знает, что такое Cilium. А BM25 – нет: слова &ldquo;cilium&rdquo;, &ldquo;ebpf&rdquo;, &ldquo;hubble&rdquo; отсутствуют в словаре, и sparse vector для этих терминов будет пустым. Hybrid search деградирует до чистого dense для любых запросов про Cilium. Хуже того, IDF старых терминов тоже устарел – частотности сдвинулись, но BM25 об этом не знает.</p>
<p>Вывод: при существенном обновлении корпуса BM25 нужно переобучать (<code>fit</code> на новых данных).</p>
</details>
<hr>
<h2 id="артефакт-bm25-encoder--hybrid-search">Артефакт: BM25 encoder + hybrid search</h2>
<p>Полный рабочий скрипт. Требует: <code>qdrant-client</code>, <code>pymorphy3</code> (опционально), Ollama с <code>mxbai-embed-large</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="ch">#!/usr/bin/env python3</span>
</span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">hybrid-search-demo.py -- демонстрация гибридного поиска Dense + BM25 + RRF.
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">Требования:
</span></span></span><span class="line"><span class="cl"><span class="s2">  pip install qdrant-client requests pymorphy3
</span></span></span><span class="line"><span class="cl"><span class="s2">  # Ollama с mxbai-embed-large запущен на localhost:11434
</span></span></span><span class="line"><span class="cl"><span class="s2">  # Qdrant запущен на localhost:6333
</span></span></span><span class="line"><span class="cl"><span class="s2">
</span></span></span><span class="line"><span class="cl"><span class="s2">Запуск:
</span></span></span><span class="line"><span class="cl"><span class="s2">  python3 hybrid-search-demo.py &#34;docker compose настройка&#34;
</span></span></span><span class="line"><span class="cl"><span class="s2">  python3 hybrid-search-demo.py &#34;kubectl apply&#34; --mode sparse
</span></span></span><span class="line"><span class="cl"><span class="s2">  python3 hybrid-search-demo.py &#34;как работает Service&#34; --mode dense
</span></span></span><span class="line"><span class="cl"><span class="s2">&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">argparse</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">math</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">re</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">time</span>
</span></span><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">requests</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">Counter</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Dict</span><span class="p">,</span> <span class="n">List</span><span class="p">,</span> <span class="n">Optional</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">qdrant_client</span> <span class="kn">import</span> <span class="n">QdrantClient</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">qdrant_client.models</span> <span class="kn">import</span> <span class="n">SparseVector</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># ============================================</span>
</span></span><span class="line"><span class="cl"><span class="c1"># BM25 ENCODER</span>
</span></span><span class="line"><span class="cl"><span class="c1"># ============================================</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="kn">import</span> <span class="nn">pymorphy3</span>
</span></span><span class="line"><span class="cl">    <span class="n">MORPH</span> <span class="o">=</span> <span class="n">pymorphy3</span><span class="o">.</span><span class="n">MorphAnalyzer</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="k">except</span> <span class="ne">ImportError</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="kn">import</span> <span class="nn">pymorphy2</span>
</span></span><span class="line"><span class="cl">        <span class="n">MORPH</span> <span class="o">=</span> <span class="n">pymorphy2</span><span class="o">.</span><span class="n">MorphAnalyzer</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="k">except</span> <span class="ne">ImportError</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">MORPH</span> <span class="o">=</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">RUSSIAN_STOPWORDS</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;на&#39;</span><span class="p">,</span> <span class="s1">&#39;в&#39;</span><span class="p">,</span> <span class="s1">&#39;во&#39;</span><span class="p">,</span> <span class="s1">&#39;с&#39;</span><span class="p">,</span> <span class="s1">&#39;со&#39;</span><span class="p">,</span> <span class="s1">&#39;к&#39;</span><span class="p">,</span> <span class="s1">&#39;ко&#39;</span><span class="p">,</span> <span class="s1">&#39;о&#39;</span><span class="p">,</span> <span class="s1">&#39;об&#39;</span><span class="p">,</span> <span class="s1">&#39;по&#39;</span><span class="p">,</span> <span class="s1">&#39;за&#39;</span><span class="p">,</span> <span class="s1">&#39;из&#39;</span><span class="p">,</span> <span class="s1">&#39;от&#39;</span><span class="p">,</span> <span class="s1">&#39;до&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;у&#39;</span><span class="p">,</span> <span class="s1">&#39;при&#39;</span><span class="p">,</span> <span class="s1">&#39;для&#39;</span><span class="p">,</span> <span class="s1">&#39;без&#39;</span><span class="p">,</span> <span class="s1">&#39;под&#39;</span><span class="p">,</span> <span class="s1">&#39;над&#39;</span><span class="p">,</span> <span class="s1">&#39;через&#39;</span><span class="p">,</span> <span class="s1">&#39;между&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;и&#39;</span><span class="p">,</span> <span class="s1">&#39;а&#39;</span><span class="p">,</span> <span class="s1">&#39;но&#39;</span><span class="p">,</span> <span class="s1">&#39;или&#39;</span><span class="p">,</span> <span class="s1">&#39;да&#39;</span><span class="p">,</span> <span class="s1">&#39;же&#39;</span><span class="p">,</span> <span class="s1">&#39;ли&#39;</span><span class="p">,</span> <span class="s1">&#39;ни&#39;</span><span class="p">,</span> <span class="s1">&#39;что&#39;</span><span class="p">,</span> <span class="s1">&#39;как&#39;</span><span class="p">,</span> <span class="s1">&#39;если&#39;</span><span class="p">,</span> <span class="s1">&#39;когда&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;чтобы&#39;</span><span class="p">,</span> <span class="s1">&#39;потому&#39;</span><span class="p">,</span> <span class="s1">&#39;поэтому&#39;</span><span class="p">,</span> <span class="s1">&#39;так&#39;</span><span class="p">,</span> <span class="s1">&#39;тоже&#39;</span><span class="p">,</span> <span class="s1">&#39;также&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;я&#39;</span><span class="p">,</span> <span class="s1">&#39;ты&#39;</span><span class="p">,</span> <span class="s1">&#39;он&#39;</span><span class="p">,</span> <span class="s1">&#39;она&#39;</span><span class="p">,</span> <span class="s1">&#39;оно&#39;</span><span class="p">,</span> <span class="s1">&#39;мы&#39;</span><span class="p">,</span> <span class="s1">&#39;вы&#39;</span><span class="p">,</span> <span class="s1">&#39;они&#39;</span><span class="p">,</span> <span class="s1">&#39;это&#39;</span><span class="p">,</span> <span class="s1">&#39;то&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;мой&#39;</span><span class="p">,</span> <span class="s1">&#39;твой&#39;</span><span class="p">,</span> <span class="s1">&#39;его&#39;</span><span class="p">,</span> <span class="s1">&#39;её&#39;</span><span class="p">,</span> <span class="s1">&#39;их&#39;</span><span class="p">,</span> <span class="s1">&#39;наш&#39;</span><span class="p">,</span> <span class="s1">&#39;ваш&#39;</span><span class="p">,</span> <span class="s1">&#39;свой&#39;</span><span class="p">,</span> <span class="s1">&#39;кто&#39;</span><span class="p">,</span> <span class="s1">&#39;сам&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;не&#39;</span><span class="p">,</span> <span class="s1">&#39;бы&#39;</span><span class="p">,</span> <span class="s1">&#39;вот&#39;</span><span class="p">,</span> <span class="s1">&#39;уже&#39;</span><span class="p">,</span> <span class="s1">&#39;ещё&#39;</span><span class="p">,</span> <span class="s1">&#39;только&#39;</span><span class="p">,</span> <span class="s1">&#39;очень&#39;</span><span class="p">,</span> <span class="s1">&#39;там&#39;</span><span class="p">,</span> <span class="s1">&#39;тут&#39;</span><span class="p">,</span> <span class="s1">&#39;где&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;быть&#39;</span><span class="p">,</span> <span class="s1">&#39;есть&#39;</span><span class="p">,</span> <span class="s1">&#39;был&#39;</span><span class="p">,</span> <span class="s1">&#39;была&#39;</span><span class="p">,</span> <span class="s1">&#39;было&#39;</span><span class="p">,</span> <span class="s1">&#39;были&#39;</span><span class="p">,</span> <span class="s1">&#39;будет&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;можно&#39;</span><span class="p">,</span> <span class="s1">&#39;нужно&#39;</span><span class="p">,</span> <span class="s1">&#39;надо&#39;</span><span class="p">,</span> <span class="s1">&#39;нельзя&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;кстати&#39;</span><span class="p">,</span> <span class="s1">&#39;например&#39;</span><span class="p">,</span> <span class="s1">&#39;конечно&#39;</span><span class="p">,</span> <span class="s1">&#39;наверное&#39;</span><span class="p">,</span> <span class="s1">&#39;возможно&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">CHAT_STOPWORDS</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;привет&#39;</span><span class="p">,</span> <span class="s1">&#39;здравствуйте&#39;</span><span class="p">,</span> <span class="s1">&#39;пожалуйста&#39;</span><span class="p">,</span> <span class="s1">&#39;спасибо&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;подскажи&#39;</span><span class="p">,</span> <span class="s1">&#39;помоги&#39;</span><span class="p">,</span> <span class="s1">&#39;объясни&#39;</span><span class="p">,</span> <span class="s1">&#39;расскажи&#39;</span><span class="p">,</span> <span class="s1">&#39;покажи&#39;</span><span class="p">,</span> <span class="s1">&#39;сделай&#39;</span><span class="p">,</span> <span class="s1">&#39;напиши&#39;</span><span class="p">,</span> <span class="s1">&#39;создай&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;отлично&#39;</span><span class="p">,</span> <span class="s1">&#39;хорошо&#39;</span><span class="p">,</span> <span class="s1">&#39;понял&#39;</span><span class="p">,</span> <span class="s1">&#39;понятно&#39;</span><span class="p">,</span> <span class="s1">&#39;готово&#39;</span><span class="p">,</span> <span class="s1">&#39;сделано&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;давай&#39;</span><span class="p">,</span> <span class="s1">&#39;ладно&#39;</span><span class="p">,</span> <span class="s1">&#39;окей&#39;</span><span class="p">,</span> <span class="s1">&#39;типа&#39;</span><span class="p">,</span> <span class="s1">&#39;короче&#39;</span><span class="p">,</span> <span class="s1">&#39;вообще&#39;</span><span class="p">,</span> <span class="s1">&#39;просто&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">ENGLISH_STOPWORDS</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;the&#39;</span><span class="p">,</span> <span class="s1">&#39;a&#39;</span><span class="p">,</span> <span class="s1">&#39;an&#39;</span><span class="p">,</span> <span class="s1">&#39;is&#39;</span><span class="p">,</span> <span class="s1">&#39;are&#39;</span><span class="p">,</span> <span class="s1">&#39;was&#39;</span><span class="p">,</span> <span class="s1">&#39;were&#39;</span><span class="p">,</span> <span class="s1">&#39;be&#39;</span><span class="p">,</span> <span class="s1">&#39;been&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;have&#39;</span><span class="p">,</span> <span class="s1">&#39;has&#39;</span><span class="p">,</span> <span class="s1">&#39;had&#39;</span><span class="p">,</span> <span class="s1">&#39;do&#39;</span><span class="p">,</span> <span class="s1">&#39;does&#39;</span><span class="p">,</span> <span class="s1">&#39;did&#39;</span><span class="p">,</span> <span class="s1">&#39;will&#39;</span><span class="p">,</span> <span class="s1">&#39;would&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;this&#39;</span><span class="p">,</span> <span class="s1">&#39;that&#39;</span><span class="p">,</span> <span class="s1">&#39;it&#39;</span><span class="p">,</span> <span class="s1">&#39;its&#39;</span><span class="p">,</span> <span class="s1">&#39;and&#39;</span><span class="p">,</span> <span class="s1">&#39;or&#39;</span><span class="p">,</span> <span class="s1">&#39;but&#39;</span><span class="p">,</span> <span class="s1">&#39;if&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;of&#39;</span><span class="p">,</span> <span class="s1">&#39;at&#39;</span><span class="p">,</span> <span class="s1">&#39;by&#39;</span><span class="p">,</span> <span class="s1">&#39;for&#39;</span><span class="p">,</span> <span class="s1">&#39;with&#39;</span><span class="p">,</span> <span class="s1">&#39;about&#39;</span><span class="p">,</span> <span class="s1">&#39;into&#39;</span><span class="p">,</span> <span class="s1">&#39;from&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;no&#39;</span><span class="p">,</span> <span class="s1">&#39;not&#39;</span><span class="p">,</span> <span class="s1">&#39;only&#39;</span><span class="p">,</span> <span class="s1">&#39;so&#39;</span><span class="p">,</span> <span class="s1">&#39;than&#39;</span><span class="p">,</span> <span class="s1">&#39;too&#39;</span><span class="p">,</span> <span class="s1">&#39;very&#39;</span><span class="p">,</span> <span class="s1">&#39;just&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">PROTECTED</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Bash</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;cd&#39;</span><span class="p">,</span> <span class="s1">&#39;ls&#39;</span><span class="p">,</span> <span class="s1">&#39;rm&#39;</span><span class="p">,</span> <span class="s1">&#39;cp&#39;</span><span class="p">,</span> <span class="s1">&#39;mv&#39;</span><span class="p">,</span> <span class="s1">&#39;cat&#39;</span><span class="p">,</span> <span class="s1">&#39;grep&#39;</span><span class="p">,</span> <span class="s1">&#39;sed&#39;</span><span class="p">,</span> <span class="s1">&#39;awk&#39;</span><span class="p">,</span> <span class="s1">&#39;find&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;head&#39;</span><span class="p">,</span> <span class="s1">&#39;tail&#39;</span><span class="p">,</span> <span class="s1">&#39;echo&#39;</span><span class="p">,</span> <span class="s1">&#39;touch&#39;</span><span class="p">,</span> <span class="s1">&#39;mkdir&#39;</span><span class="p">,</span> <span class="s1">&#39;chmod&#39;</span><span class="p">,</span> <span class="s1">&#39;chown&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;curl&#39;</span><span class="p">,</span> <span class="s1">&#39;wget&#39;</span><span class="p">,</span> <span class="s1">&#39;ssh&#39;</span><span class="p">,</span> <span class="s1">&#39;scp&#39;</span><span class="p">,</span> <span class="s1">&#39;rsync&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;ps&#39;</span><span class="p">,</span> <span class="s1">&#39;kill&#39;</span><span class="p">,</span> <span class="s1">&#39;top&#39;</span><span class="p">,</span> <span class="s1">&#39;df&#39;</span><span class="p">,</span> <span class="s1">&#39;du&#39;</span><span class="p">,</span> <span class="s1">&#39;free&#39;</span><span class="p">,</span> <span class="s1">&#39;tar&#39;</span><span class="p">,</span> <span class="s1">&#39;zip&#39;</span><span class="p">,</span> <span class="s1">&#39;unzip&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Docker</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;up&#39;</span><span class="p">,</span> <span class="s1">&#39;down&#39;</span><span class="p">,</span> <span class="s1">&#39;run&#39;</span><span class="p">,</span> <span class="s1">&#39;exec&#39;</span><span class="p">,</span> <span class="s1">&#39;build&#39;</span><span class="p">,</span> <span class="s1">&#39;push&#39;</span><span class="p">,</span> <span class="s1">&#39;pull&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;stop&#39;</span><span class="p">,</span> <span class="s1">&#39;start&#39;</span><span class="p">,</span> <span class="s1">&#39;restart&#39;</span><span class="p">,</span> <span class="s1">&#39;logs&#39;</span><span class="p">,</span> <span class="s1">&#39;images&#39;</span><span class="p">,</span> <span class="s1">&#39;prune&#39;</span><span class="p">,</span> <span class="s1">&#39;compose&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Git</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;add&#39;</span><span class="p">,</span> <span class="s1">&#39;commit&#39;</span><span class="p">,</span> <span class="s1">&#39;push&#39;</span><span class="p">,</span> <span class="s1">&#39;pull&#39;</span><span class="p">,</span> <span class="s1">&#39;fetch&#39;</span><span class="p">,</span> <span class="s1">&#39;merge&#39;</span><span class="p">,</span> <span class="s1">&#39;rebase&#39;</span><span class="p">,</span> <span class="s1">&#39;checkout&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;clone&#39;</span><span class="p">,</span> <span class="s1">&#39;init&#39;</span><span class="p">,</span> <span class="s1">&#39;status&#39;</span><span class="p">,</span> <span class="s1">&#39;diff&#39;</span><span class="p">,</span> <span class="s1">&#39;log&#39;</span><span class="p">,</span> <span class="s1">&#39;reset&#39;</span><span class="p">,</span> <span class="s1">&#39;stash&#39;</span><span class="p">,</span> <span class="s1">&#39;tag&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Kubernetes</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;get&#39;</span><span class="p">,</span> <span class="s1">&#39;set&#39;</span><span class="p">,</span> <span class="s1">&#39;apply&#39;</span><span class="p">,</span> <span class="s1">&#39;delete&#39;</span><span class="p">,</span> <span class="s1">&#39;describe&#39;</span><span class="p">,</span> <span class="s1">&#39;logs&#39;</span><span class="p">,</span> <span class="s1">&#39;exec&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;scale&#39;</span><span class="p">,</span> <span class="s1">&#39;rollout&#39;</span><span class="p">,</span> <span class="s1">&#39;expose&#39;</span><span class="p">,</span> <span class="s1">&#39;create&#39;</span><span class="p">,</span> <span class="s1">&#39;edit&#39;</span><span class="p">,</span> <span class="s1">&#39;patch&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Программирование</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;if&#39;</span><span class="p">,</span> <span class="s1">&#39;else&#39;</span><span class="p">,</span> <span class="s1">&#39;for&#39;</span><span class="p">,</span> <span class="s1">&#39;while&#39;</span><span class="p">,</span> <span class="s1">&#39;do&#39;</span><span class="p">,</span> <span class="s1">&#39;case&#39;</span><span class="p">,</span> <span class="s1">&#39;try&#39;</span><span class="p">,</span> <span class="s1">&#39;catch&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;return&#39;</span><span class="p">,</span> <span class="s1">&#39;break&#39;</span><span class="p">,</span> <span class="s1">&#39;continue&#39;</span><span class="p">,</span> <span class="s1">&#39;pass&#39;</span><span class="p">,</span> <span class="s1">&#39;raise&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;import&#39;</span><span class="p">,</span> <span class="s1">&#39;from&#39;</span><span class="p">,</span> <span class="s1">&#39;as&#39;</span><span class="p">,</span> <span class="s1">&#39;with&#39;</span><span class="p">,</span> <span class="s1">&#39;in&#39;</span><span class="p">,</span> <span class="s1">&#39;on&#39;</span><span class="p">,</span> <span class="s1">&#39;not&#39;</span><span class="p">,</span> <span class="s1">&#39;is&#39;</span><span class="p">,</span> <span class="s1">&#39;or&#39;</span><span class="p">,</span> <span class="s1">&#39;and&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;def&#39;</span><span class="p">,</span> <span class="s1">&#39;class&#39;</span><span class="p">,</span> <span class="s1">&#39;fn&#39;</span><span class="p">,</span> <span class="s1">&#39;func&#39;</span><span class="p">,</span> <span class="s1">&#39;function&#39;</span><span class="p">,</span> <span class="s1">&#39;let&#39;</span><span class="p">,</span> <span class="s1">&#39;const&#39;</span><span class="p">,</span> <span class="s1">&#39;var&#39;</span><span class="p">,</span> <span class="s1">&#39;mut&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;true&#39;</span><span class="p">,</span> <span class="s1">&#39;false&#39;</span><span class="p">,</span> <span class="s1">&#39;null&#39;</span><span class="p">,</span> <span class="s1">&#39;none&#39;</span><span class="p">,</span> <span class="s1">&#39;nil&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># HTTP / API</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;get&#39;</span><span class="p">,</span> <span class="s1">&#39;post&#39;</span><span class="p">,</span> <span class="s1">&#39;put&#39;</span><span class="p">,</span> <span class="s1">&#39;patch&#39;</span><span class="p">,</span> <span class="s1">&#39;delete&#39;</span><span class="p">,</span> <span class="s1">&#39;head&#39;</span><span class="p">,</span> <span class="s1">&#39;options&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;api&#39;</span><span class="p">,</span> <span class="s1">&#39;http&#39;</span><span class="p">,</span> <span class="s1">&#39;https&#39;</span><span class="p">,</span> <span class="s1">&#39;tcp&#39;</span><span class="p">,</span> <span class="s1">&#39;udp&#39;</span><span class="p">,</span> <span class="s1">&#39;ssh&#39;</span><span class="p">,</span> <span class="s1">&#39;ftp&#39;</span><span class="p">,</span> <span class="s1">&#39;dns&#39;</span><span class="p">,</span> <span class="s1">&#39;ssl&#39;</span><span class="p">,</span> <span class="s1">&#39;tls&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># SQL</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;select&#39;</span><span class="p">,</span> <span class="s1">&#39;insert&#39;</span><span class="p">,</span> <span class="s1">&#39;update&#39;</span><span class="p">,</span> <span class="s1">&#39;delete&#39;</span><span class="p">,</span> <span class="s1">&#39;from&#39;</span><span class="p">,</span> <span class="s1">&#39;where&#39;</span><span class="p">,</span> <span class="s1">&#39;join&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;order&#39;</span><span class="p">,</span> <span class="s1">&#39;by&#39;</span><span class="p">,</span> <span class="s1">&#39;group&#39;</span><span class="p">,</span> <span class="s1">&#39;having&#39;</span><span class="p">,</span> <span class="s1">&#39;limit&#39;</span><span class="p">,</span> <span class="s1">&#39;offset&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># DevOps инструменты</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;docker&#39;</span><span class="p">,</span> <span class="s1">&#39;kubernetes&#39;</span><span class="p">,</span> <span class="s1">&#39;k8s&#39;</span><span class="p">,</span> <span class="s1">&#39;helm&#39;</span><span class="p">,</span> <span class="s1">&#39;ansible&#39;</span><span class="p">,</span> <span class="s1">&#39;terraform&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;nginx&#39;</span><span class="p">,</span> <span class="s1">&#39;redis&#39;</span><span class="p">,</span> <span class="s1">&#39;postgres&#39;</span><span class="p">,</span> <span class="s1">&#39;mysql&#39;</span><span class="p">,</span> <span class="s1">&#39;mongo&#39;</span><span class="p">,</span> <span class="s1">&#39;elastic&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;prometheus&#39;</span><span class="p">,</span> <span class="s1">&#39;grafana&#39;</span><span class="p">,</span> <span class="s1">&#39;loki&#39;</span><span class="p">,</span> <span class="s1">&#39;vault&#39;</span><span class="p">,</span> <span class="s1">&#39;consul&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># Языки и фреймворки</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;python&#39;</span><span class="p">,</span> <span class="s1">&#39;rust&#39;</span><span class="p">,</span> <span class="s1">&#39;go&#39;</span><span class="p">,</span> <span class="s1">&#39;java&#39;</span><span class="p">,</span> <span class="s1">&#39;node&#39;</span><span class="p">,</span> <span class="s1">&#39;npm&#39;</span><span class="p">,</span> <span class="s1">&#39;yarn&#39;</span><span class="p">,</span> <span class="s1">&#39;pip&#39;</span><span class="p">,</span> <span class="s1">&#39;cargo&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="s1">&#39;react&#39;</span><span class="p">,</span> <span class="s1">&#39;vue&#39;</span><span class="p">,</span> <span class="s1">&#39;django&#39;</span><span class="p">,</span> <span class="s1">&#39;flask&#39;</span><span class="p">,</span> <span class="s1">&#39;fastapi&#39;</span><span class="p">,</span> <span class="s1">&#39;axum&#39;</span><span class="p">,</span> <span class="s1">&#39;tokio&#39;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">FINAL_STOPWORDS</span> <span class="o">=</span> <span class="p">(</span><span class="n">RUSSIAN_STOPWORDS</span> <span class="o">|</span> <span class="n">ENGLISH_STOPWORDS</span> <span class="o">|</span> <span class="n">CHAT_STOPWORDS</span><span class="p">)</span> <span class="o">-</span> <span class="n">PROTECTED</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">SimpleBM25</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">k1</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">1.5</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.75</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">k1</span> <span class="o">=</span> <span class="n">k1</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">b</span> <span class="o">=</span> <span class="n">b</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">vocab</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">int</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">idf</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">float</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">avg_dl</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">doc_count</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">_stem</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">word</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">MORPH</span> <span class="ow">and</span> <span class="n">re</span><span class="o">.</span><span class="k">match</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;^[а-яё]+$&#39;</span><span class="p">,</span> <span class="n">word</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            <span class="n">parsed</span> <span class="o">=</span> <span class="n">MORPH</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="n">word</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">parsed</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="k">return</span> <span class="n">parsed</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">normal_form</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">word</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">_tokenize</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">        <span class="n">text</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="n">raw</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="sa">r</span><span class="s1">&#39;\b[a-zA-Zа-яА-ЯёЁ0-9_]{2,}\b&#39;</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">_stem</span><span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">raw</span> <span class="k">if</span> <span class="n">t</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">FINAL_STOPWORDS</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">fit</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">documents</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="s1">&#39;SimpleBM25&#39;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">doc_freqs</span><span class="p">:</span> <span class="n">Counter</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">        <span class="n">total_len</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl">        <span class="n">valid_docs</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">documents</span> <span class="k">if</span> <span class="n">d</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">valid_docs</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">tokens</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_tokenize</span><span class="p">(</span><span class="n">doc</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="n">total_len</span> <span class="o">+=</span> <span class="nb">len</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">for</span> <span class="n">token</span> <span class="ow">in</span> <span class="nb">set</span><span class="p">(</span><span class="n">tokens</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">                <span class="n">doc_freqs</span><span class="p">[</span><span class="n">token</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">doc_count</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">valid_docs</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">avg_dl</span> <span class="o">=</span> <span class="n">total_len</span> <span class="o">/</span> <span class="bp">self</span><span class="o">.</span><span class="n">doc_count</span> <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">doc_count</span> <span class="k">else</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">term</span><span class="p">,</span> <span class="n">df</span> <span class="ow">in</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">doc_freqs</span><span class="o">.</span><span class="n">items</span><span class="p">()):</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">df</span> <span class="o">&gt;=</span> <span class="mi">2</span><span class="p">:</span>  <span class="c1"># min_df: фильтр мусора</span>
</span></span><span class="line"><span class="cl">                <span class="bp">self</span><span class="o">.</span><span class="n">vocab</span><span class="p">[</span><span class="n">term</span><span class="p">]</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">vocab</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">                <span class="bp">self</span><span class="o">.</span><span class="n">idf</span><span class="p">[</span><span class="n">term</span><span class="p">]</span> <span class="o">=</span> <span class="n">math</span><span class="o">.</span><span class="n">log</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">                    <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">doc_count</span> <span class="o">-</span> <span class="n">df</span> <span class="o">+</span> <span class="mf">0.5</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">df</span> <span class="o">+</span> <span class="mf">0.5</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span>
</span></span><span class="line"><span class="cl">                <span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">encode</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">List</span><span class="p">]:</span>
</span></span><span class="line"><span class="cl">        <span class="n">tokens</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_tokenize</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">tf</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">doc_len</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">indices</span><span class="p">,</span> <span class="n">values</span> <span class="o">=</span> <span class="p">[],</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">for</span> <span class="n">term</span><span class="p">,</span> <span class="n">freq</span> <span class="ow">in</span> <span class="n">tf</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">term</span> <span class="ow">not</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">vocab</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="k">continue</span>
</span></span><span class="line"><span class="cl">            <span class="n">idx</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">vocab</span><span class="p">[</span><span class="n">term</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="n">idf</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">idf</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">term</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="n">num</span> <span class="o">=</span> <span class="n">freq</span> <span class="o">*</span> <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">k1</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="n">den</span> <span class="o">=</span> <span class="n">freq</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">k1</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">b</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">b</span> <span class="o">*</span> <span class="n">doc_len</span> <span class="o">/</span> <span class="nb">max</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">avg_dl</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="n">score</span> <span class="o">=</span> <span class="n">idf</span> <span class="o">*</span> <span class="p">(</span><span class="n">num</span> <span class="o">/</span> <span class="n">den</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="n">score</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">                <span class="n">indices</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">idx</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">                <span class="n">values</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">round</span><span class="p">(</span><span class="n">score</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">indices</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">pairs</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">indices</span><span class="p">,</span> <span class="n">values</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="n">indices</span><span class="p">,</span> <span class="n">values</span> <span class="o">=</span> <span class="p">[</span><span class="nb">list</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">pairs</span><span class="p">)]</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="p">{</span><span class="s2">&#34;indices&#34;</span><span class="p">:</span> <span class="n">indices</span><span class="p">,</span> <span class="s2">&#34;values&#34;</span><span class="p">:</span> <span class="n">values</span><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># ============================================</span>
</span></span><span class="line"><span class="cl"><span class="c1"># HYBRID SEARCH</span>
</span></span><span class="line"><span class="cl"><span class="c1"># ============================================</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="n">OLLAMA_URL</span> <span class="o">=</span> <span class="s2">&#34;http://localhost:11434/api/embed&#34;</span>
</span></span><span class="line"><span class="cl"><span class="n">EMBED_MODEL</span> <span class="o">=</span> <span class="s2">&#34;mxbai-embed-large&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">get_embedding</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Optional</span><span class="p">[</span><span class="n">List</span><span class="p">[</span><span class="nb">float</span><span class="p">]]:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Dense embedding через Ollama.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="k">try</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">resp</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="n">OLLAMA_URL</span><span class="p">,</span> <span class="n">json</span><span class="o">=</span><span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="s2">&#34;model&#34;</span><span class="p">:</span> <span class="n">EMBED_MODEL</span><span class="p">,</span> <span class="s2">&#34;input&#34;</span><span class="p">:</span> <span class="n">text</span><span class="p">[:</span><span class="mi">800</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="p">},</span> <span class="n">timeout</span><span class="o">=</span><span class="mi">30</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">resp</span><span class="o">.</span><span class="n">status_code</span> <span class="o">==</span> <span class="mi">200</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="k">return</span> <span class="n">resp</span><span class="o">.</span><span class="n">json</span><span class="p">()</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&#34;embeddings&#34;</span><span class="p">,</span> <span class="p">[</span><span class="kc">None</span><span class="p">])[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="k">except</span> <span class="ne">Exception</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">pass</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">def</span> <span class="nf">hybrid_search</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">    <span class="n">client</span><span class="p">:</span> <span class="n">QdrantClient</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">bm25</span><span class="p">:</span> <span class="n">SimpleBM25</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">collection</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">limit</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">10</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">    <span class="n">mode</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s2">&#34;hybrid&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="s2">&#34;&#34;&#34;Гибридный поиск: Dense + BM25 + RRF.&#34;&#34;&#34;</span>
</span></span><span class="line"><span class="cl">    <span class="n">dense_vec</span> <span class="o">=</span> <span class="n">get_embedding</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">sparse_vec</span> <span class="o">=</span> <span class="n">bm25</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">query</span><span class="p">)</span> <span class="k">if</span> <span class="n">mode</span> <span class="o">!=</span> <span class="s2">&#34;dense&#34;</span> <span class="k">else</span> <span class="kc">None</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">prefetch</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">limit</span> <span class="o">*</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">50</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">DENSE_W</span><span class="p">,</span> <span class="n">SPARSE_W</span><span class="p">,</span> <span class="n">K</span> <span class="o">=</span> <span class="mf">0.7</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mi">60</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">mode</span> <span class="o">==</span> <span class="s2">&#34;dense&#34;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">client</span><span class="o">.</span><span class="n">query_points</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">collection_name</span><span class="o">=</span><span class="n">collection</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">query</span><span class="o">=</span><span class="n">dense_vec</span><span class="p">,</span> <span class="n">using</span><span class="o">=</span><span class="s2">&#34;dense&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">limit</span><span class="o">=</span><span class="n">limit</span><span class="p">,</span> <span class="n">with_payload</span><span class="o">=</span><span class="kc">True</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span><span class="o">.</span><span class="n">points</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">mode</span> <span class="o">==</span> <span class="s2">&#34;sparse&#34;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="n">client</span><span class="o">.</span><span class="n">query_points</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">collection_name</span><span class="o">=</span><span class="n">collection</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">query</span><span class="o">=</span><span class="n">SparseVector</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">                <span class="n">indices</span><span class="o">=</span><span class="n">sparse_vec</span><span class="p">[</span><span class="s2">&#34;indices&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">                <span class="n">values</span><span class="o">=</span><span class="n">sparse_vec</span><span class="p">[</span><span class="s2">&#34;values&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="p">),</span>
</span></span><span class="line"><span class="cl">            <span class="n">using</span><span class="o">=</span><span class="s2">&#34;sparse&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">limit</span><span class="o">=</span><span class="n">limit</span><span class="p">,</span> <span class="n">with_payload</span><span class="o">=</span><span class="kc">True</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span><span class="o">.</span><span class="n">points</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Hybrid: два запроса + RRF</span>
</span></span><span class="line"><span class="cl">    <span class="n">dense_res</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">query_points</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">        <span class="n">collection_name</span><span class="o">=</span><span class="n">collection</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="n">query</span><span class="o">=</span><span class="n">dense_vec</span><span class="p">,</span> <span class="n">using</span><span class="o">=</span><span class="s2">&#34;dense&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">        <span class="n">limit</span><span class="o">=</span><span class="n">prefetch</span><span class="p">,</span> <span class="n">with_payload</span><span class="o">=</span><span class="kc">True</span>
</span></span><span class="line"><span class="cl">    <span class="p">)</span><span class="o">.</span><span class="n">points</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="n">sparse_vec</span> <span class="ow">and</span> <span class="n">sparse_vec</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&#34;indices&#34;</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">sparse_res</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">query_points</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">            <span class="n">collection_name</span><span class="o">=</span><span class="n">collection</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">query</span><span class="o">=</span><span class="n">SparseVector</span><span class="p">(</span>
</span></span><span class="line"><span class="cl">                <span class="n">indices</span><span class="o">=</span><span class="n">sparse_vec</span><span class="p">[</span><span class="s2">&#34;indices&#34;</span><span class="p">],</span>
</span></span><span class="line"><span class="cl">                <span class="n">values</span><span class="o">=</span><span class="n">sparse_vec</span><span class="p">[</span><span class="s2">&#34;values&#34;</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="p">),</span>
</span></span><span class="line"><span class="cl">            <span class="n">using</span><span class="o">=</span><span class="s2">&#34;sparse&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">            <span class="n">limit</span><span class="o">=</span><span class="n">prefetch</span><span class="p">,</span> <span class="n">with_payload</span><span class="o">=</span><span class="kc">True</span>
</span></span><span class="line"><span class="cl">        <span class="p">)</span><span class="o">.</span><span class="n">points</span>
</span></span><span class="line"><span class="cl">    <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">sparse_res</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># RRF fusion</span>
</span></span><span class="line"><span class="cl">    <span class="n">scores</span><span class="p">,</span> <span class="n">payloads</span> <span class="o">=</span> <span class="p">{},</span> <span class="p">{}</span>
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">rank</span><span class="p">,</span> <span class="n">r</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">dense_res</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">rid</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">id</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">scores</span><span class="p">[</span><span class="n">rid</span><span class="p">]</span> <span class="o">=</span> <span class="n">DENSE_W</span> <span class="o">/</span> <span class="p">(</span><span class="n">K</span> <span class="o">+</span> <span class="n">rank</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">payloads</span><span class="p">[</span><span class="n">rid</span><span class="p">]</span> <span class="o">=</span> <span class="n">r</span><span class="o">.</span><span class="n">payload</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">for</span> <span class="n">rank</span><span class="p">,</span> <span class="n">r</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">sparse_res</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">        <span class="n">rid</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">id</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="n">scores</span><span class="p">[</span><span class="n">rid</span><span class="p">]</span> <span class="o">=</span> <span class="n">scores</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">rid</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="n">SPARSE_W</span> <span class="o">/</span> <span class="p">(</span><span class="n">K</span> <span class="o">+</span> <span class="n">rank</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="n">rid</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">payloads</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">payloads</span><span class="p">[</span><span class="n">rid</span><span class="p">]</span> <span class="o">=</span> <span class="n">r</span><span class="o">.</span><span class="n">payload</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">ranked</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">scores</span><span class="o">.</span><span class="n">items</span><span class="p">(),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)[:</span><span class="n">limit</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">class</span> <span class="nc">_Result</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">score</span><span class="p">,</span> <span class="n">payload</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">score</span> <span class="o">=</span> <span class="n">score</span>
</span></span><span class="line"><span class="cl">            <span class="bp">self</span><span class="o">.</span><span class="n">payload</span> <span class="o">=</span> <span class="n">payload</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="p">[</span><span class="n">_Result</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">payloads</span><span class="p">[</span><span class="n">rid</span><span class="p">])</span> <span class="k">for</span> <span class="n">rid</span><span class="p">,</span> <span class="n">s</span> <span class="ow">in</span> <span class="n">ranked</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># ============================================</span>
</span></span><span class="line"><span class="cl"><span class="c1"># CLI</span>
</span></span><span class="line"><span class="cl"><span class="c1"># ============================================</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">&#34;__main__&#34;</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="n">parser</span> <span class="o">=</span> <span class="n">argparse</span><span class="o">.</span><span class="n">ArgumentParser</span><span class="p">(</span><span class="n">description</span><span class="o">=</span><span class="s2">&#34;Hybrid search demo&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s2">&#34;query&#34;</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s2">&#34;Search query&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s2">&#34;--mode&#34;</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="s2">&#34;hybrid&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                        <span class="n">choices</span><span class="o">=</span><span class="p">[</span><span class="s2">&#34;dense&#34;</span><span class="p">,</span> <span class="s2">&#34;sparse&#34;</span><span class="p">,</span> <span class="s2">&#34;hybrid&#34;</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">    <span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s2">&#34;--limit&#34;</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">int</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s2">&#34;--collection&#34;</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="s2">&#34;hybrid_collection&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse_args</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">client</span> <span class="o">=</span> <span class="n">QdrantClient</span><span class="p">(</span><span class="n">host</span><span class="o">=</span><span class="s2">&#34;localhost&#34;</span><span class="p">,</span> <span class="n">port</span><span class="o">=</span><span class="mi">6333</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># BM25 needs to be fitted on your corpus first</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># bm25 = SimpleBM25().fit(your_documents)</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># For demo, load pre-fitted model:</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># bm25 = SimpleBM25.load(&#34;bm25_model.json&#34;)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Query: </span><span class="si">{</span><span class="n">args</span><span class="o">.</span><span class="n">query</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Mode:  </span><span class="si">{</span><span class="n">args</span><span class="o">.</span><span class="n">mode</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;---&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">t0</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># results = hybrid_search(client, bm25, args.collection,</span>
</span></span><span class="line"><span class="cl">    <span class="c1">#                         args.query, args.limit, args.mode)</span>
</span></span><span class="line"><span class="cl">    <span class="n">elapsed</span> <span class="o">=</span> <span class="p">(</span><span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">t0</span><span class="p">)</span> <span class="o">*</span> <span class="mi">1000</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># for r in results:</span>
</span></span><span class="line"><span class="cl">    <span class="c1">#     text = r.payload.get(&#34;text&#34;, &#34;&#34;)[:100]</span>
</span></span><span class="line"><span class="cl">    <span class="c1">#     print(f&#34;  [{r.score:.4f}] {text}&#34;)</span>
</span></span><span class="line"><span class="cl">    <span class="c1"># print(f&#34;\n{elapsed:.0f}ms&#34;)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1"># Demo: show tokenization</span>
</span></span><span class="line"><span class="cl">    <span class="n">bm25</span> <span class="o">=</span> <span class="n">SimpleBM25</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">    <span class="n">tokens</span> <span class="o">=</span> <span class="n">bm25</span><span class="o">.</span><span class="n">_tokenize</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">query</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;BM25 tokens: </span><span class="si">{</span><span class="n">tokens</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Stopwords removed: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">FINAL_STOPWORDS</span><span class="p">)</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Protected tokens: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">PROTECTED</span><span class="p">)</span><span class="si">}</span><span class="s2">&#34;</span><span class="p">)</span>
</span></span></code></pre></div><hr>
<h2 id="продакшен-параметры">Продакшен-параметры</h2>
<table>
	<thead>
			<tr>
					<th>Параметр</th>
					<th>Значение</th>
					<th>Почему</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>Dense model</td>
					<td>mxbai-embed-large (1024d)</td>
					<td>Лучшее качество на русском (<a href="/posts/rag-02-embeddings/">пост 2/N</a>)</td>
			</tr>
			<tr>
					<td>Sparse model</td>
					<td>SimpleBM25 (k1=1.5, b=0.75)</td>
					<td>Стандартные BM25 параметры, проверенные на корпусе</td>
			</tr>
			<tr>
					<td>Стемминг</td>
					<td>pymorphy3 (лемматизация)</td>
					<td>Русская морфология: &ldquo;настройка/настройки/настроить&rdquo; → один стем</td>
			</tr>
			<tr>
					<td>Стоп-слова</td>
					<td>232 (рус + англ + chat)</td>
					<td>Ручная подборка, не <a href="https://www.nltk.org/">NLTK</a></td>
			</tr>
			<tr>
					<td>Защищённые токены</td>
					<td>179</td>
					<td>DevOps-команды, которые нельзя фильтровать</td>
			</tr>
			<tr>
					<td>Словарь BM25</td>
					<td>~125 000 терминов</td>
					<td>min_df=2, обучен на 235K+ сообщений</td>
			</tr>
			<tr>
					<td>RRF K</td>
					<td>60</td>
					<td>Сглаживание: все 50 prefetch-кандидатов значимы</td>
			</tr>
			<tr>
					<td>Dense weight</td>
					<td>0.7</td>
					<td>Семантика доминирует</td>
			</tr>
			<tr>
					<td>Sparse weight</td>
					<td>0.3</td>
					<td>Ключевые слова буcтят точные совпадения</td>
			</tr>
			<tr>
					<td>Prefetch</td>
					<td>5x от limit (min 50)</td>
					<td>Достаточно кандидатов для RRF fusion</td>
			</tr>
			<tr>
					<td>Dense latency</td>
					<td>~50ms</td>
					<td>Ollama, localhost, GPU</td>
			</tr>
			<tr>
					<td>BM25 encode</td>
					<td>~5ms</td>
					<td>In-memory vocab, Python</td>
			</tr>
			<tr>
					<td>Qdrant search</td>
					<td>~15ms + ~10ms</td>
					<td>235K+ points, localhost</td>
			</tr>
			<tr>
					<td>RRF fusion</td>
					<td>&lt;1ms</td>
					<td>Dict merge + sort</td>
			</tr>
			<tr>
					<td><strong>Total</strong></td>
					<td><strong>~80ms</strong></td>
					<td>Пренебрежимо мало vs LLM inference</td>
			</tr>
			<tr>
					<td>Коллекция Qdrant</td>
					<td>dense (1024d, cosine) + sparse</td>
					<td>Один upsert – два вектора</td>
			</tr>
			<tr>
					<td>Корпус</td>
					<td>235 000+ сообщений</td>
					<td>Рабочие сессии + web</td>
			</tr>
	</tbody>
</table>
<h3 id="эволюция">Эволюция</h3>
<table>
	<thead>
			<tr>
					<th>Версия</th>
					<th>Метод</th>
					<th>Проблема</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td>v1</td>
					<td>Dense only (cosine)</td>
					<td>Теряет <code>kubectl get pods</code> при запросе &ldquo;получить список подов&rdquo;</td>
			</tr>
			<tr>
					<td>v2</td>
					<td>Dense + BM25 + RRF</td>
					<td>Находит и по смыслу, и по ключевым словам</td>
			</tr>
	</tbody>
</table>
<hr>
<h2 id="что-дальше">Что дальше</h2>
<p>Гибридный поиск находит релевантные фрагменты. Но 10 результатов – это ещё не ответ. Нужно:</p>
<ul>
<li><strong>RAG Pipeline 5/N: Reranking</strong>. У нас 10 результатов из hybrid search, но порядок может быть неоптимальным. Cross-encoder reranking переоценивает каждую пару (запрос, документ) и выдаёт финальный ранг.</li>
</ul>
<hr>
<p>Telegram: <a href="https://t.me/DevITWay">@DevITWay</a>
Сайт: <a href="https://devopsway.ru/">devopsway.ru</a></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
