solr查询工作原理深入内幕

1. 什么是 Lucene？

Lucene 是一个开放源代码项目，自问世以来引发了开源社群的巨大反响。程序员们不仅使用它构建具体的全文检索应用，还将其集成到各种系统软件及 Web 应用中。甚至某些商业软件也采用 Lucene 作为其内部全文检索子系统的核心。例如，Apache 软件基金会的网站使用 Lucene 作为全文检索引擎，IBM 的开源软件 Eclipse 2.1 版本中也采用了 Lucene 作为帮助子系统的全文索引引擎，相应的 IBM 商业软件 WebSphere 中同样采用了 Lucene。

Lucene 以其开放源代码的特性、优异的索引结构以及良好的系统架构，获得了越来越多的应用。作为一个全文检索引擎，它具有如下突出优点：

索引文件格式独立于应用平台：Lucene 定义了一套以 8 位字节为基础的索引文件格式，使得兼容系统或者不同平台的应用能够共享建立的索引文件。
分块索引机制：在传统全文检索引擎的倒排索引基础上，实现了分块索引。能够针对新的文件建立小文件索引，提升索引速度，然后通过与原有索引的合并达到优化目的。
优秀的面向对象架构：使得对于 Lucene 扩展的学习难度降低，方便扩充新功能。
独立的文本分析接口：设计了独立于语言和文件格式的文本分析接口，索引器通过接受 Token 流完成索引文件的创立。用户扩展新的语言和文件格式，只需要实现文本分析的接口。
强大的默认查询引擎：用户无需自己编写代码即可获得强大的查询能力。Lucene 的查询实现中默认实现了布尔操作、模糊查询（Fuzzy Search）、分组查询等等。

2. 什么是 Solr？

2.1 为什么需要 Solr？

Solr 是将整个索引操作功能封装好的搜索引擎系统（企业级搜索引擎产品）。引入 Solr 主要基于以下考量：

降低业务系统负载：Solr 可以部署到单独的服务器上（Web 服务），提供服务。业务系统只需发送请求、接收响应即可。
突破存储限制：Solr 部署在专门的服务器上，其索引库不会受业务系统服务器存储空间的限制。
支持分布式集群：索引服务的容量和能力可以线性扩展。

2.2 Solr 的工作机制

Solr 是在 Lucene 工具包的基础之上进行了封装，并以 Web 服务的形式对外提供索引功能。

索引操作：业务系统需要使用索引功能（建索引、查索引）时，只要发出 HTTP 请求，并将返回数据进行解析即可。
技术栈：Solr 是 Apache 下的一个顶级开源项目，采用 Java 开发，是基于 Lucene 的全文搜索服务器。它提供了比 Lucene 更为丰富的查询语言，同时实现了可配置、可扩展，并对索引、搜索性能进行了优化。
运行环境：Solr 可以独立运行，也可以运行在 Jetty、Tomcat 等 Servlet 容器中。
交互方式：
- 索引：实现方法很简单，用 POST 方法向 Solr 服务器发送一个描述 Field 及其内容的 XML 文档，Solr 根据 XML 文档添加、删除、更新索引。
- 搜索：只需要发送 HTTP GET 请求，然后对 Solr 返回 XML、JSON 等格式的查询结果进行解析，组织页面布局。
管理界面：Solr 不提供构建 UI 的功能，但提供了一个管理界面。通过管理界面可以查询 Solr 的配置和运行情况。

3. Lucene 和 Solr 的关系

Solr 是门户，Lucene 是底层基础。Solr 和 Lucene 的关系正如 Hadoop 和 HDFS 的关系。

4. Jetty 是什么？

Jetty 是一个开源的 Servlet 容器，它为基于 Java 的 Web 容器（例如 JSP 和 Servlet）提供运行环境。Jetty 是使用 Java 语言编写的，它的 API 以一组 JAR 包的形式发布。开发人员可以将 Jetty 容器实例化成一个对象，迅速为一些独立运行（Stand-alone）的 Java 应用提供网络和 Web 连接。

5. 流程概况

6. Jetty 接收请求并处理

设置本地调试方法见 lucene-solr 本地调试方法。

6.1 启动代码示例

StartSolrJetty.java 核心代码如下：

public static void main(String[] args) {
    // System.setProperty("solr.solr.home", "../../../example/solr");

    Server server = new Server();
    ServerConnector connector = new ServerConnector(server, new HttpConnectionFactory());
    // Set some timeout options to make debugging easier.
    connector.setIdleTimeout(1000 * 60 * 60);
    connector.setSoLingerTime(-1);
    connector.setPort(8983);
    server.setConnectors(new Connector[] { connector });
    
    WebAppContext bb = new WebAppContext();
    bb.setServer(server);
    bb.setContextPath("/solr");
    bb.setWar("solr/webapp/web");

    //    // START JMX SERVER
    //    if( true ) {
    //      MBeanServer mBeanServer = ManagementFactory.getPlatformMBeanServer();
    //      MBeanContainer mBeanContainer = new MBeanContainer(mBeanServer);
    //      server.getContainer().addEventListener(mBeanContainer);
    //      mBeanContainer.start();
    //    }
    
    server.setHandler(bb);

    try {
        System.out.println(">>> STARTING EMBEDDED JETTY SERVER, PRESS ANY KEY TO STOP");
        server.start();
        while (System.in.available() == 0) {
            Thread.sleep(5000);
        }
        server.stop();
        server.join();
    } catch (Exception e) {
        e.printStackTrace();
        System.exit(100);
    }
}

6.2 核心组件说明

其中，Server 是 HTTP 服务器，聚合了 Connector（HTTP 请求接收器）和请求处理器 Handler。Server 本身是一个 Handler 和一个线程池，Connector 使用线程池来调用 handle 方法。

/** 
 * Jetty HTTP Servlet Server.
 * This class is the main class for the Jetty HTTP Servlet server.
 * It aggregates Connectors (HTTP request receivers) and request Handlers.
 * The server is itself a handler and a ThreadPool. Connectors use the ThreadPool methods
 * to run jobs that will eventually call the handle method.
 */

其工作流程如下图所示：

因其不是本文重点，故略去不述。

7. Solr 调用 Lucene 过程

上篇文章 Solr 调用 Lucene 底层实现倒排索引源码解析已经论述，可对照上面的整体流程图进行解读，故此处略去不述。

8. Lucene 调用过程

从上图可以看出，整个过程主要分为两个阶段。

8.1 创建 Weight

8.1.1 创建 BooleanWeight

// BooleanWeight.java
BooleanWeight(BooleanQuery query, IndexSearcher searcher, boolean needsScores, float boost) throws IOException {
    super(query);
    this.query = query;
    this.needsScores = needsScores;
    this.similarity = searcher.getSimilarity(needsScores);
    weights = new ArrayList<>();
    for (BooleanClause c : query) {
        Query q = c.getQuery();
        Weight w = searcher.createWeight(q, needsScores && c.isScoring(), boost);
        weights.add(w);
    }
}

8.1.2 同义词权重分析

// SynonymQuery.java
@Override
public Weight createWeight(IndexSearcher searcher, boolean needsScores, float boost) throws IOException {
    if (needsScores) {
        return new SynonymWeight(this, searcher, boost);
    } else {
        // if scores are not needed, let BooleanWeight deal with optimizing that case.
        BooleanQuery.Builder bq = new BooleanQuery.Builder();
        for (Term term : terms) {
            bq.add(new TermQuery(term), BooleanClause.Occur.SHOULD);
        }
        return searcher.rewrite(bq.build()).createWeight(searcher, needsScores, boost);
    }
}

8.1.3 TermQuery 与 TermWeight

// TermQuery.java
@Override
public Weight createWeight(IndexSearcher searcher, boolean needsScores, float boost) throws IOException {
    final IndexReaderContext context = searcher.getTopReaderContext();
    final TermContext termState;
    if (perReaderTermState == null
            || perReaderTermState.wasBuiltFor(context) == false) {
        if (needsScores) {
            // make TermQuery single-pass if we don't have a PRTS or if the context
            // differs!
            termState = TermContext.build(context, term);
        } else {
            // do not compute the term state, this will help save seeks in the terms
            // dict on segments that have a cache entry for this query
            termState = null;
        }
    } else {
        // PRTS was pre-build for this IS
        termState = this.perReaderTermState;
    }

    return new TermWeight(searcher, needsScores, boost, termState);
}

调用 TermWeight，计算 CollectionStatistics 和 TermStatistics：

public TermWeight(IndexSearcher searcher, boolean needsScores,
    float boost, TermContext termStates) throws IOException {
    super(TermQuery.this);
    if (needsScores && termStates == null) {
        throw new IllegalStateException("termStates are required when scores are needed");
    }
    this.needsScores = needsScores;
    this.termStates = termStates;
    this.similarity = searcher.getSimilarity(needsScores);

    final CollectionStatistics collectionStats;
    final TermStatistics termStats;
    if (needsScores) {
        termStates.setQuery(this.getQuery().getKeyword());
        collectionStats = searcher.collectionStatistics(term.field());
        termStats = searcher.termStatistics(term, termStates);
    } else {
        // we do not need the actual stats, use fake stats with docFreq=maxDoc and ttf=-1
        final int maxDoc = searcher.getIndexReader().maxDoc();
        collectionStats = new CollectionStatistics(term.field(), maxDoc, -1, -1, -1);
        termStats = new TermStatistics(term.bytes(), maxDoc, -1, term.bytes());
    }
   
    this.stats = similarity.computeWeight(boost, collectionStats, termStats);
}

调用 Similarity 的 computeWeight（以 BM25 为例）：

// BM25Similarity.java
@Override
public final SimWeight computeWeight(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) {
    Explanation idf = termStats.length == 1 ? idfExplain(collectionStats, termStats[0]) : idfExplain(collectionStats, termStats);
    float avgdl = avgFieldLength(collectionStats);

    float[] oldCache = new float[256];
    float[] cache = new float[256];
    for (int i = 0; i < cache.length; i++) {
        oldCache[i] = k1 * ((1 - b) + b * OLD_LENGTH_TABLE[i] / avgdl);
        cache[i] = k1 * ((1 - b) + b * LENGTH_TABLE[i] / avgdl);
    }
    return new BM25Stats(collectionStats.field(), boost, idf, avgdl, oldCache, cache);
}

8.2 查询过程

完整过程如下：IndexSearcher 调用 search 方法。

protected void search(List<LeafReaderContext> leaves, Weight weight, Collector collector)
    throws IOException {

    // TODO: should we make this threaded...? the Collector could be sync'd?
    // always use single thread:
    for (LeafReaderContext ctx : leaves) { // search each subreader
        final LeafCollector leafCollector;
        try {
            leafCollector = collector.getLeafCollector(ctx); // 1
        } catch (CollectionTerminatedException e) {
            // there is no doc of interest in this reader context
            // continue with the following leaf
            continue;
        }
        BulkScorer scorer = weight.bulkScorer(ctx); // 2
        if (scorer != null) {
            try {
                scorer.score(leafCollector, ctx.reader().getLiveDocs()); // 3
            } catch (CollectionTerminatedException e) {
                // collection was terminated prematurely
                // continue with the following leaf
            }
        }
    }
}

8.2.1 获取 Collector

// TopScoreDocCollector.java#SimpleTopScoreDocCollector
@Override
public LeafCollector getLeafCollector(LeafReaderContext context)
    throws IOException {
    final int docBase = context.docBase;
    return new ScorerLeafCollector() {

        @Override
        public void collect(int doc) throws IOException {
            float score = scorer.score();
            /* Document document=context.reader().document(doc); */
       
            // This collector cannot handle these scores:
            assert score != Float.NEGATIVE_INFINITY;
            assert !Float.isNaN(score);

            totalHits++;
            if (score <= pqTop.score) {
                // Since docs are returned in-order (i.e., increasing doc Id), a document
                // with equal score to pqTop.score cannot compete since HitQueue favors
                // documents with lower doc Ids. Therefore reject those docs too.
                return;
            }
            pqTop.doc = doc + docBase;
            pqTop.score = score;
            pqTop = pq.updateTop();
        }
    };
}

8.2.2 调用打分 Score

/**
 * Optional method, to return a {@link BulkScorer} to
 * score the query and send hits to a {@link Collector}.
 * Only queries that have a different top-level approach
 * need to override this; the default implementation
 * pulls a normal {@link Scorer} and iterates and
 * collects the resulting hits which are not marked as deleted.
 *
 * @param context the {@link org.apache.lucene.index.LeafReaderContext} for which to return the {@link Scorer}.
 * @return a {@link BulkScorer} which scores documents and passes them to a collector.
 * @throws IOException if there is a low-level I/O error
 */
public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {
    Scorer scorer = scorer(context);
    if (scorer == null) {
        // No docs match
        return null;
    }

    // This impl always scores docs in order, so we can ignore scoreDocsInOrder:
    return new DefaultBulkScorer(scorer);
}

/** Just wraps a Scorer and performs top scoring using it. */
protected static class DefaultBulkScorer extends BulkScorer {
    private final Scorer scorer;
    private final DocIdSetIterator iterator;
    private final TwoPhaseIterator twoPhase;

    /** Sole constructor. */
    public DefaultBulkScorer(Scorer scorer) {
        if (scorer == null) {
            throw new NullPointerException();
        }
        this.scorer = scorer;
        this.iterator = scorer.iterator();
        this.twoPhase = scorer.twoPhaseIterator();
    }

    @Override
    public long cost() {
        return iterator.cost();
    }

    @Override
    public int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException {
        collector.setScorer(scorer);
        if (scorer.docID() == -1 && min == 0 && max == DocIdSetIterator.NO_MORE_DOCS) {
            scoreAll(collector, iterator, twoPhase, acceptDocs);
            return DocIdSetIterator.NO_MORE_DOCS;
        } else {
            int doc = scorer.docID();
            if (doc < min) {
                if (twoPhase == null) {
                    doc = iterator.advance(min);
                } else {
                    doc = twoPhase.approximation().advance(min);
                }
            }
            return scoreRange(collector, iterator, twoPhase, acceptDocs, doc, max);
        }
    }
}

调用 scoreAll 方法，遍历 Document，执行 SimpleTopScoreDocCollector 的 collect 方法，包含打分逻辑（见上文 SimpleTopScoreDocCollector 代码）。

/** 
 * Specialized method to bulk-score all hits; we separate this from {@link #scoreRange} 
 * to help out hotspot.
 * See <a href="https://issues.apache.org/jira/browse/LUCENE-5487">LUCENE-5487</a> 
 */
static void scoreAll(LeafCollector collector, DocIdSetIterator iterator, TwoPhaseIterator twoPhase, Bits acceptDocs) throws IOException {
    if (twoPhase == null) {
        for (int doc = iterator.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = iterator.nextDoc()) {
            if (acceptDocs == null || acceptDocs.get(doc)) {
                collector.collect(doc);
            }
        }
    } else {
        // The scorer has an approximation, so run the approximation first, then check acceptDocs, then confirm
        final DocIdSetIterator approximation = twoPhase.approximation();
        for (int doc = approximation.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = approximation.nextDoc()) {
            if ((acceptDocs == null || acceptDocs.get(doc)) && twoPhase.matches()) {
                collector.collect(doc);
            }
        }
    }
}

总结

整个查询流程涉及多个组件的协同工作，从 Jetty 接收请求到 Solr 封装，再到 Lucene 底层的 Weight 创建与评分计算，梳理起来较为复杂。希望本文的源码解析能帮助大家深入理解 Solr 查询的工作原理。

参考资料

说明：本文基于 Solr 与 Lucene 较早版本（大致对应 Lucene 6.x/7.x 时期）的源码进行分析。Lucene 8.x 及后续版本在 BulkScorer、Similarity 等核心接口上有所变更，但整体查询流程架构依然具有参考价值。

本文地址：https://1diff.fun/archives/solr-cha-xun-gong-zuo-yuan-li-shen-ru-nei-mu.html

如果对本文有什么问题或疑问都可以在评论区留言，我看到后会尽量解答。