1. 什么是 Lucene?

Lucene 是一个开放源代码项目,自问世以来引发了开源社群的巨大反响。程序员们不仅使用它构建具体的全文检索应用,还将其集成到各种系统软件及 Web 应用中。甚至某些商业软件也采用 Lucene 作为其内部全文检索子系统的核心。例如,Apache 软件基金会的网站使用 Lucene 作为全文检索引擎,IBM 的开源软件 Eclipse 2.1 版本中也采用了 Lucene 作为帮助子系统的全文索引引擎,相应的 IBM 商业软件 WebSphere 中同样采用了 Lucene。

Lucene 以其开放源代码的特性、优异的索引结构以及良好的系统架构,获得了越来越多的应用。作为一个全文检索引擎,它具有如下突出优点:

  1. 索引文件格式独立于应用平台:Lucene 定义了一套以 8 位字节为基础的索引文件格式,使得兼容系统或者不同平台的应用能够共享建立的索引文件。
  2. 分块索引机制:在传统全文检索引擎的倒排索引基础上,实现了分块索引。能够针对新的文件建立小文件索引,提升索引速度,然后通过与原有索引的合并达到优化目的。
  3. 优秀的面向对象架构:使得对于 Lucene 扩展的学习难度降低,方便扩充新功能。
  4. 独立的文本分析接口:设计了独立于语言和文件格式的文本分析接口,索引器通过接受 Token 流完成索引文件的创立。用户扩展新的语言和文件格式,只需要实现文本分析的接口。
  5. 强大的默认查询引擎:用户无需自己编写代码即可获得强大的查询能力。Lucene 的查询实现中默认实现了布尔操作、模糊查询(Fuzzy Search)、分组查询等等。

2. 什么是 Solr?

2.1 为什么需要 Solr?

Solr 是将整个索引操作功能封装好的搜索引擎系统(企业级搜索引擎产品)。引入 Solr 主要基于以下考量:

  • 降低业务系统负载:Solr 可以部署到单独的服务器上(Web 服务),提供服务。业务系统只需发送请求、接收响应即可。
  • 突破存储限制:Solr 部署在专门的服务器上,其索引库不会受业务系统服务器存储空间的限制。
  • 支持分布式集群:索引服务的容量和能力可以线性扩展。

2.2 Solr 的工作机制

Solr 是在 Lucene 工具包的基础之上进行了封装,并以 Web 服务的形式对外提供索引功能。

  • 索引操作:业务系统需要使用索引功能(建索引、查索引)时,只要发出 HTTP 请求,并将返回数据进行解析即可。
  • 技术栈:Solr 是 Apache 下的一个顶级开源项目,采用 Java 开发,是基于 Lucene 的全文搜索服务器。它提供了比 Lucene 更为丰富的查询语言,同时实现了可配置、可扩展,并对索引、搜索性能进行了优化。
  • 运行环境:Solr 可以独立运行,也可以运行在 Jetty、Tomcat 等 Servlet 容器中。
  • 交互方式

    • 索引:实现方法很简单,用 POST 方法向 Solr 服务器发送一个描述 Field 及其内容的 XML 文档,Solr 根据 XML 文档添加、删除、更新索引。
    • 搜索:只需要发送 HTTP GET 请求,然后对 Solr 返回 XML、JSON 等格式的查询结果进行解析,组织页面布局。
  • 管理界面:Solr 不提供构建 UI 的功能,但提供了一个管理界面。通过管理界面可以查询 Solr 的配置和运行情况。

3. Lucene 和 Solr 的关系

Solr 是门户,Lucene 是底层基础。Solr 和 Lucene 的关系正如 Hadoop 和 HDFS 的关系。

4. Jetty 是什么?

Jetty 是一个开源的 Servlet 容器,它为基于 Java 的 Web 容器(例如 JSP 和 Servlet)提供运行环境。Jetty 是使用 Java 语言编写的,它的 API 以一组 JAR 包的形式发布。开发人员可以将 Jetty 容器实例化成一个对象,迅速为一些独立运行(Stand-alone)的 Java 应用提供网络和 Web 连接。

5. 流程概况

6. Jetty 接收请求并处理

设置本地调试方法见 lucene-solr 本地调试方法

6.1 启动代码示例

StartSolrJetty.java 核心代码如下:

public static void main(String[] args) {
    // System.setProperty("solr.solr.home", "../../../example/solr");

    Server server = new Server();
    ServerConnector connector = new ServerConnector(server, new HttpConnectionFactory());
    // Set some timeout options to make debugging easier.
    connector.setIdleTimeout(1000 * 60 * 60);
    connector.setSoLingerTime(-1);
    connector.setPort(8983);
    server.setConnectors(new Connector[] { connector });
    
    WebAppContext bb = new WebAppContext();
    bb.setServer(server);
    bb.setContextPath("/solr");
    bb.setWar("solr/webapp/web");

    //    // START JMX SERVER
    //    if( true ) {
    //      MBeanServer mBeanServer = ManagementFactory.getPlatformMBeanServer();
    //      MBeanContainer mBeanContainer = new MBeanContainer(mBeanServer);
    //      server.getContainer().addEventListener(mBeanContainer);
    //      mBeanContainer.start();
    //    }
    
    server.setHandler(bb);

    try {
        System.out.println(">>> STARTING EMBEDDED JETTY SERVER, PRESS ANY KEY TO STOP");
        server.start();
        while (System.in.available() == 0) {
            Thread.sleep(5000);
        }
        server.stop();
        server.join();
    } catch (Exception e) {
        e.printStackTrace();
        System.exit(100);
    }
}

6.2 核心组件说明

其中,Server 是 HTTP 服务器,聚合了 Connector(HTTP 请求接收器)和请求处理器 HandlerServer 本身是一个 Handler 和一个线程池,Connector 使用线程池来调用 handle 方法。

/** 
 * Jetty HTTP Servlet Server.
 * This class is the main class for the Jetty HTTP Servlet server.
 * It aggregates Connectors (HTTP request receivers) and request Handlers.
 * The server is itself a handler and a ThreadPool. Connectors use the ThreadPool methods
 * to run jobs that will eventually call the handle method.
 */

其工作流程如下图所示:

因其不是本文重点,故略去不述。

7. Solr 调用 Lucene 过程

上篇文章 Solr 调用 Lucene 底层实现倒排索引源码解析 已经论述,可对照上面的整体流程图进行解读,故此处略去不述。

8. Lucene 调用过程

从上图可以看出,整个过程主要分为两个阶段。

8.1 创建 Weight

8.1.1 创建 BooleanWeight

// BooleanWeight.java
BooleanWeight(BooleanQuery query, IndexSearcher searcher, boolean needsScores, float boost) throws IOException {
    super(query);
    this.query = query;
    this.needsScores = needsScores;
    this.similarity = searcher.getSimilarity(needsScores);
    weights = new ArrayList<>();
    for (BooleanClause c : query) {
        Query q = c.getQuery();
        Weight w = searcher.createWeight(q, needsScores && c.isScoring(), boost);
        weights.add(w);
    }
}

8.1.2 同义词权重分析

// SynonymQuery.java
@Override
public Weight createWeight(IndexSearcher searcher, boolean needsScores, float boost) throws IOException {
    if (needsScores) {
        return new SynonymWeight(this, searcher, boost);
    } else {
        // if scores are not needed, let BooleanWeight deal with optimizing that case.
        BooleanQuery.Builder bq = new BooleanQuery.Builder();
        for (Term term : terms) {
            bq.add(new TermQuery(term), BooleanClause.Occur.SHOULD);
        }
        return searcher.rewrite(bq.build()).createWeight(searcher, needsScores, boost);
    }
}

8.1.3 TermQuery 与 TermWeight

// TermQuery.java
@Override
public Weight createWeight(IndexSearcher searcher, boolean needsScores, float boost) throws IOException {
    final IndexReaderContext context = searcher.getTopReaderContext();
    final TermContext termState;
    if (perReaderTermState == null
            || perReaderTermState.wasBuiltFor(context) == false) {
        if (needsScores) {
            // make TermQuery single-pass if we don't have a PRTS or if the context
            // differs!
            termState = TermContext.build(context, term);
        } else {
            // do not compute the term state, this will help save seeks in the terms
            // dict on segments that have a cache entry for this query
            termState = null;
        }
    } else {
        // PRTS was pre-build for this IS
        termState = this.perReaderTermState;
    }

    return new TermWeight(searcher, needsScores, boost, termState);
}

调用 TermWeight,计算 CollectionStatisticsTermStatistics

public TermWeight(IndexSearcher searcher, boolean needsScores,
    float boost, TermContext termStates) throws IOException {
    super(TermQuery.this);
    if (needsScores && termStates == null) {
        throw new IllegalStateException("termStates are required when scores are needed");
    }
    this.needsScores = needsScores;
    this.termStates = termStates;
    this.similarity = searcher.getSimilarity(needsScores);

    final CollectionStatistics collectionStats;
    final TermStatistics termStats;
    if (needsScores) {
        termStates.setQuery(this.getQuery().getKeyword());
        collectionStats = searcher.collectionStatistics(term.field());
        termStats = searcher.termStatistics(term, termStates);
    } else {
        // we do not need the actual stats, use fake stats with docFreq=maxDoc and ttf=-1
        final int maxDoc = searcher.getIndexReader().maxDoc();
        collectionStats = new CollectionStatistics(term.field(), maxDoc, -1, -1, -1);
        termStats = new TermStatistics(term.bytes(), maxDoc, -1, term.bytes());
    }
   
    this.stats = similarity.computeWeight(boost, collectionStats, termStats);
}

调用 SimilaritycomputeWeight(以 BM25 为例):

// BM25Similarity.java
@Override
public final SimWeight computeWeight(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) {
    Explanation idf = termStats.length == 1 ? idfExplain(collectionStats, termStats[0]) : idfExplain(collectionStats, termStats);
    float avgdl = avgFieldLength(collectionStats);

    float[] oldCache = new float[256];
    float[] cache = new float[256];
    for (int i = 0; i < cache.length; i++) {
        oldCache[i] = k1 * ((1 - b) + b * OLD_LENGTH_TABLE[i] / avgdl);
        cache[i] = k1 * ((1 - b) + b * LENGTH_TABLE[i] / avgdl);
    }
    return new BM25Stats(collectionStats.field(), boost, idf, avgdl, oldCache, cache);
}

8.2 查询过程

完整过程如下:IndexSearcher 调用 search 方法。

protected void search(List<LeafReaderContext> leaves, Weight weight, Collector collector)
    throws IOException {

    // TODO: should we make this threaded...? the Collector could be sync'd?
    // always use single thread:
    for (LeafReaderContext ctx : leaves) { // search each subreader
        final LeafCollector leafCollector;
        try {
            leafCollector = collector.getLeafCollector(ctx); // 1
        } catch (CollectionTerminatedException e) {
            // there is no doc of interest in this reader context
            // continue with the following leaf
            continue;
        }
        BulkScorer scorer = weight.bulkScorer(ctx); // 2
        if (scorer != null) {
            try {
                scorer.score(leafCollector, ctx.reader().getLiveDocs()); // 3
            } catch (CollectionTerminatedException e) {
                // collection was terminated prematurely
                // continue with the following leaf
            }
        }
    }
}

8.2.1 获取 Collector

// TopScoreDocCollector.java#SimpleTopScoreDocCollector
@Override
public LeafCollector getLeafCollector(LeafReaderContext context)
    throws IOException {
    final int docBase = context.docBase;
    return new ScorerLeafCollector() {

        @Override
        public void collect(int doc) throws IOException {
            float score = scorer.score();
            /* Document document=context.reader().document(doc); */
       
            // This collector cannot handle these scores:
            assert score != Float.NEGATIVE_INFINITY;
            assert !Float.isNaN(score);

            totalHits++;
            if (score <= pqTop.score) {
                // Since docs are returned in-order (i.e., increasing doc Id), a document
                // with equal score to pqTop.score cannot compete since HitQueue favors
                // documents with lower doc Ids. Therefore reject those docs too.
                return;
            }
            pqTop.doc = doc + docBase;
            pqTop.score = score;
            pqTop = pq.updateTop();
        }
    };
}

8.2.2 调用打分 Score

/**
 * Optional method, to return a {@link BulkScorer} to
 * score the query and send hits to a {@link Collector}.
 * Only queries that have a different top-level approach
 * need to override this; the default implementation
 * pulls a normal {@link Scorer} and iterates and
 * collects the resulting hits which are not marked as deleted.
 *
 * @param context the {@link org.apache.lucene.index.LeafReaderContext} for which to return the {@link Scorer}.
 * @return a {@link BulkScorer} which scores documents and passes them to a collector.
 * @throws IOException if there is a low-level I/O error
 */
public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {
    Scorer scorer = scorer(context);
    if (scorer == null) {
        // No docs match
        return null;
    }

    // This impl always scores docs in order, so we can ignore scoreDocsInOrder:
    return new DefaultBulkScorer(scorer);
}

/** Just wraps a Scorer and performs top scoring using it. */
protected static class DefaultBulkScorer extends BulkScorer {
    private final Scorer scorer;
    private final DocIdSetIterator iterator;
    private final TwoPhaseIterator twoPhase;

    /** Sole constructor. */
    public DefaultBulkScorer(Scorer scorer) {
        if (scorer == null) {
            throw new NullPointerException();
        }
        this.scorer = scorer;
        this.iterator = scorer.iterator();
        this.twoPhase = scorer.twoPhaseIterator();
    }

    @Override
    public long cost() {
        return iterator.cost();
    }

    @Override
    public int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException {
        collector.setScorer(scorer);
        if (scorer.docID() == -1 && min == 0 && max == DocIdSetIterator.NO_MORE_DOCS) {
            scoreAll(collector, iterator, twoPhase, acceptDocs);
            return DocIdSetIterator.NO_MORE_DOCS;
        } else {
            int doc = scorer.docID();
            if (doc < min) {
                if (twoPhase == null) {
                    doc = iterator.advance(min);
                } else {
                    doc = twoPhase.approximation().advance(min);
                }
            }
            return scoreRange(collector, iterator, twoPhase, acceptDocs, doc, max);
        }
    }
}

调用 scoreAll 方法,遍历 Document,执行 SimpleTopScoreDocCollectorcollect 方法,包含打分逻辑(见上文 SimpleTopScoreDocCollector 代码)。

/** 
 * Specialized method to bulk-score all hits; we separate this from {@link #scoreRange} 
 * to help out hotspot.
 * See <a href="https://issues.apache.org/jira/browse/LUCENE-5487">LUCENE-5487</a> 
 */
static void scoreAll(LeafCollector collector, DocIdSetIterator iterator, TwoPhaseIterator twoPhase, Bits acceptDocs) throws IOException {
    if (twoPhase == null) {
        for (int doc = iterator.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = iterator.nextDoc()) {
            if (acceptDocs == null || acceptDocs.get(doc)) {
                collector.collect(doc);
            }
        }
    } else {
        // The scorer has an approximation, so run the approximation first, then check acceptDocs, then confirm
        final DocIdSetIterator approximation = twoPhase.approximation();
        for (int doc = approximation.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = approximation.nextDoc()) {
            if ((acceptDocs == null || acceptDocs.get(doc)) && twoPhase.matches()) {
                collector.collect(doc);
            }
        }
    }
}

总结

整个查询流程涉及多个组件的协同工作,从 Jetty 接收请求到 Solr 封装,再到 Lucene 底层的 Weight 创建与评分计算,梳理起来较为复杂。希望本文的源码解析能帮助大家深入理解 Solr 查询的工作原理。

参考资料

  1. http://www.blogjava.net/hoojo/archive/2012/09/06/387140.html
  2. https://baike.baidu.com/item/jetty/370234?fr=aladdin

说明:本文基于 Solr 与 Lucene 较早版本(大致对应 Lucene 6.x/7.x 时期)的源码进行分析。Lucene 8.x 及后续版本在 BulkScorerSimilarity 等核心接口上有所变更,但整体查询流程架构依然具有参考价值。