org.apache.lucene/lucene-grouping/8.2.0 : org/apache/lucene/search/grouping/package-info.java

org.apache.lucene.search.grouping
http://lucene.apache.org/lucene-parent/lucene-grouping: Lucene Grouping Module (The Apache Software Foundation)
Apache 2
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

Grouping.

This module enables search result grouping with Lucene, where hits
with the same value in the specified single-valued group field are
grouped together.  For example, if you group by the author
field, then all documents with the same value in the author
field fall into a single group.

Grouping requires a number of inputs:

  groupField: this is the field used for grouping.
      For example, if you use the author field then each
      group has all books by the same author.  Documents that don't
      have this field are grouped under a single group with
      a null group value.
  
groupSort: how the groups are sorted.  For sorting
      purposes, each group is "represented" by the highest-sorted
      document according to the groupSort within it.  For
      example, if you specify "price" (ascending) then the first group
      is the one with the lowest price book within it.  Or if you
      specify relevance group sort, then the first group is the one
      containing the highest scoring book.
  
topNGroups: how many top groups to keep.  For
      example, 10 means the top 10 groups are computed.
  
groupOffset: which "slice" of top groups you want to
      retrieve.  For example, 3 means you'll get 7 groups back
      (assuming topNGroups is 10).  This is useful for
      paging, where you might show 5 groups per page.
  
withinGroupSort: how the documents within each group
      are sorted.  This can be different from the group sort.
  
maxDocsPerGroup: how many top documents within each
      group to keep.
  
withinGroupOffset: which "slice" of top
      documents you want to retrieve from each group.

The implementation is two-pass: the first pass (FirstPassGroupingCollector) gathers the top groups, and the second pass (SecondPassGroupingCollector) gathers documents within those groups. If the search is costly to run you may want to use the CachingCollector class, which caches hits and can (quickly) replay them for the second pass. This way you only run the query once, but you pay a RAM cost to (briefly) hold all hits. Results are returned as a TopGroups instance.
Groups are defined by GroupSelector implementations:
  
    TermGroupSelector groups based on the value of a SortedDocValues field
    ValueSourceGroupSelector groups based on the value of a ValueSource
  
Known limitations:

   Sharding is not directly supported, though is not too
    difficult, if you can merge the top groups and top documents per
    group yourself.

Typical usage for the generic two-pass grouping search looks like this using the grouping convenience utility
  (optionally using caching for the second pass search):
  GroupingSearch groupingSearch = new GroupingSearch("author");
  groupingSearch.setGroupSort(groupSort);
  groupingSearch.setFillSortFields(fillFields);
  if (useCache) {
    // Sets cache in MB
    groupingSearch.setCachingInMB(4.0, true);
  }
  if (requiredTotalGroupCount) {
    groupingSearch.setAllGroups(true);
  }
  TermQuery query = new TermQuery(new Term("content", searchTerm));
  TopGroups<BytesRef> result = groupingSearch.search(indexSearcher, query, groupOffset, groupLimit);
  // Render groupsResult...
  if (requiredTotalGroupCount) {
    int totalGroupCount = result.totalGroupCount;
  }

To use the single-pass BlockGroupingCollector,
   first, at indexing time, you must ensure all docs in each group
   are added as a block, and you have some way to find the last
   document of each group.  One simple way to do this is to add a
   marker binary field:
  // Create Documents from your source:
  List<Document> oneGroup = ...;
  
  Field groupEndField = new Field("groupEnd", "x", Field.Store.NO, Field.Index.NOT_ANALYZED);
  groupEndField.setIndexOptions(IndexOptions.DOCS_ONLY);
  groupEndField.setOmitNorms(true);
  oneGroup.get(oneGroup.size()-1).add(groupEndField);
  // You can also use writer.updateDocuments(); just be sure you
  // replace an entire previous doc block with this new one.  For
  // example, each group could have a "groupID" field, with the same
  // value for all docs in this group:
  writer.addDocuments(oneGroup);

Then, at search time, do this up front:
  // Set this once in your app & save away for reusing across all queries:
  Filter groupEndDocs = new CachingWrapperFilter(new QueryWrapperFilter(new TermQuery(new Term("groupEnd", "x"))));

Finally, do this per search:
  // Per search:
  BlockGroupingCollector c = new BlockGroupingCollector(groupSort, groupOffset+topNGroups, needsScores, groupEndDocs);
  s.search(new TermQuery(new Term("content", searchTerm)), c);
  TopGroups groupsResult = c.getTopGroups(withinGroupSort, groupOffset, docOffset, docOffset+docsPerGroup, fillFields);
  // Render groupsResult...

Or alternatively use the GroupingSearch convenience utility:
  // Per search:
  GroupingSearch groupingSearch = new GroupingSearch(groupEndDocs);
  groupingSearch.setGroupSort(groupSort);
  groupingSearch.setIncludeScores(needsScores);
  TermQuery query = new TermQuery(new Term("content", searchTerm));
  TopGroups groupsResult = groupingSearch.search(indexSearcher, query, groupOffset, groupLimit);
  // Render groupsResult...

Note that the groupValue of each GroupDocs
will be null, so if you need to present this value you'll
have to separately retrieve it (for example using stored
fields, FieldCache, etc.).
Another collector is the AllGroupHeadsCollector that can be used to retrieve all most relevant
   documents per group. Also known as group heads. This can be useful in situations when one wants to compute group
   based facets / statistics on the complete query result. The collector can be executed during the first or second
   phase. This collector can also be used with the GroupingSearch convenience utility, but when if one only
   wants to compute the most relevant documents per group it is better to just use the collector as done here below.
  TermGroupSelector grouper = new TermGroupSelector(groupField);
  AllGroupHeadsCollector c = AllGroupHeadsCollector.newCollector(grouper, sortWithinGroup);
  s.search(new TermQuery(new Term("content", searchTerm)), c);
  // Return all group heads as int array
  int[] groupHeadsArray = c.retrieveGroupHeads()
  // Return all group heads as FixedBitSet.
  int maxDoc = s.maxDoc();
  FixedBitSet groupHeadsBitSet = c.retrieveGroupHeads(maxDoc)

/** 
 * Grouping.
 * <p>
 * This module enables search result grouping with Lucene, where hits
 * with the same value in the specified single-valued group field are
 * grouped together.  For example, if you group by the <code>author</code>
 * field, then all documents with the same value in the <code>author</code>
 * field fall into a single group.
 * </p>
 * 
 * <p>Grouping requires a number of inputs:</p>
 * 
 * <ul>
 *   <li><code>groupField</code>: this is the field used for grouping.
 *       For example, if you use the <code>author</code> field then each
 *       group has all books by the same author.  Documents that don't
 *       have this field are grouped under a single group with
 *       a <code>null</code> group value.
 * 
 *   <li><code>groupSort</code>: how the groups are sorted.  For sorting
 *       purposes, each group is "represented" by the highest-sorted
 *       document according to the <code>groupSort</code> within it.  For
 *       example, if you specify "price" (ascending) then the first group
 *       is the one with the lowest price book within it.  Or if you
 *       specify relevance group sort, then the first group is the one
 *       containing the highest scoring book.
 * 
 *   <li><code>topNGroups</code>: how many top groups to keep.  For
 *       example, 10 means the top 10 groups are computed.
 * 
 *   <li><code>groupOffset</code>: which "slice" of top groups you want to
 *       retrieve.  For example, 3 means you'll get 7 groups back
 *       (assuming <code>topNGroups</code> is 10).  This is useful for
 *       paging, where you might show 5 groups per page.
 * 
 *   <li><code>withinGroupSort</code>: how the documents within each group
 *       are sorted.  This can be different from the group sort.
 * 
 *   <li><code>maxDocsPerGroup</code>: how many top documents within each
 *       group to keep.
 * 
 *   <li><code>withinGroupOffset</code>: which "slice" of top
 *       documents you want to retrieve from each group.
 * 
 * </ul>
 * 
 * <p>The implementation is two-pass: the first pass ({@link
 *   org.apache.lucene.search.grouping.FirstPassGroupingCollector})
 *   gathers the top groups, and the second pass ({@link
 *   org.apache.lucene.search.grouping.SecondPassGroupingCollector})
 *   gathers documents within those groups.  If the search is costly to
 *   run you may want to use the {@link
 *   org.apache.lucene.search.CachingCollector} class, which
 *   caches hits and can (quickly) replay them for the second pass.  This
 *   way you only run the query once, but you pay a RAM cost to (briefly)
 *   hold all hits.  Results are returned as a {@link
 *   org.apache.lucene.search.grouping.TopGroups} instance.</p>
 * 
 * <p>Groups are defined by {@link org.apache.lucene.search.grouping.GroupSelector}
 *   implementations:</p>
 *   <ul>
 *     <li>{@link org.apache.lucene.search.grouping.TermGroupSelector} groups based on
 *     the value of a {@link org.apache.lucene.index.SortedDocValues} field</li>
 *     <li>{@link org.apache.lucene.search.grouping.ValueSourceGroupSelector} groups based on
 *     the value of a {@link org.apache.lucene.queries.function.ValueSource}</li>
 *   </ul>
 * 
 * <p>Known limitations:</p>
 * <ul>
 *   <li> Sharding is not directly supported, though is not too
 *     difficult, if you can merge the top groups and top documents per
 *     group yourself.
 * </ul>
 * 
 * <p>Typical usage for the generic two-pass grouping search looks like this using the grouping convenience utility
 *   (optionally using caching for the second pass search):</p>
 * 
 * <pre class="prettyprint">
 *   GroupingSearch groupingSearch = new GroupingSearch("author");
 *   groupingSearch.setGroupSort(groupSort);
 *   groupingSearch.setFillSortFields(fillFields);
 * 
 *   if (useCache) {
 *     // Sets cache in MB
 *     groupingSearch.setCachingInMB(4.0, true);
 *   }
 * 
 *   if (requiredTotalGroupCount) {
 *     groupingSearch.setAllGroups(true);
 *   }
 * 
 *   TermQuery query = new TermQuery(new Term("content", searchTerm));
 *   TopGroups&lt;BytesRef&gt; result = groupingSearch.search(indexSearcher, query, groupOffset, groupLimit);
 * 
 *   // Render groupsResult...
 *   if (requiredTotalGroupCount) {
 *     int totalGroupCount = result.totalGroupCount;
 *   }
 * </pre>
 * 
 * <p>To use the single-pass <code>BlockGroupingCollector</code>,
 *    first, at indexing time, you must ensure all docs in each group
 *    are added as a block, and you have some way to find the last
 *    document of each group.  One simple way to do this is to add a
 *    marker binary field:</p>
 * 
 * <pre class="prettyprint">
 *   // Create Documents from your source:
 *   List&lt;Document&gt; oneGroup = ...;
 *   
 *   Field groupEndField = new Field("groupEnd", "x", Field.Store.NO, Field.Index.NOT_ANALYZED);
 *   groupEndField.setIndexOptions(IndexOptions.DOCS_ONLY);
 *   groupEndField.setOmitNorms(true);
 *   oneGroup.get(oneGroup.size()-1).add(groupEndField);
 * 
 *   // You can also use writer.updateDocuments(); just be sure you
 *   // replace an entire previous doc block with this new one.  For
 *   // example, each group could have a "groupID" field, with the same
 *   // value for all docs in this group:
 *   writer.addDocuments(oneGroup);
 * </pre>
 * 
 * Then, at search time, do this up front:
 * 
 * <pre class="prettyprint">
 *   // Set this once in your app &amp; save away for reusing across all queries:
 *   Filter groupEndDocs = new CachingWrapperFilter(new QueryWrapperFilter(new TermQuery(new Term("groupEnd", "x"))));
 * </pre>
 * 
 * Finally, do this per search:
 * 
 * <pre class="prettyprint">
 *   // Per search:
 *   BlockGroupingCollector c = new BlockGroupingCollector(groupSort, groupOffset+topNGroups, needsScores, groupEndDocs);
 *   s.search(new TermQuery(new Term("content", searchTerm)), c);
 *   TopGroups groupsResult = c.getTopGroups(withinGroupSort, groupOffset, docOffset, docOffset+docsPerGroup, fillFields);
 * 
 *   // Render groupsResult...
 * </pre>
 * 
 * Or alternatively use the <code>GroupingSearch</code> convenience utility:
 * 
 * <pre class="prettyprint">
 *   // Per search:
 *   GroupingSearch groupingSearch = new GroupingSearch(groupEndDocs);
 *   groupingSearch.setGroupSort(groupSort);
 *   groupingSearch.setIncludeScores(needsScores);
 *   TermQuery query = new TermQuery(new Term("content", searchTerm));
 *   TopGroups groupsResult = groupingSearch.search(indexSearcher, query, groupOffset, groupLimit);
 *
 *   // Render groupsResult...
 * </pre>
 * 
 * Note that the <code>groupValue</code> of each <code>GroupDocs</code>
 * will be <code>null</code>, so if you need to present this value you'll
 * have to separately retrieve it (for example using stored
 * fields, <code>FieldCache</code>, etc.).
 * 
 * <p>Another collector is the <code>AllGroupHeadsCollector</code> that can be used to retrieve all most relevant
 *    documents per group. Also known as group heads. This can be useful in situations when one wants to compute group
 *    based facets / statistics on the complete query result. The collector can be executed during the first or second
 *    phase. This collector can also be used with the <code>GroupingSearch</code> convenience utility, but when if one only
 *    wants to compute the most relevant documents per group it is better to just use the collector as done here below.</p>
 * 
 * <pre class="prettyprint">
 *   TermGroupSelector grouper = new TermGroupSelector(groupField);
 *   AllGroupHeadsCollector c = AllGroupHeadsCollector.newCollector(grouper, sortWithinGroup);
 *   s.search(new TermQuery(new Term("content", searchTerm)), c);
 *   // Return all group heads as int array
 *   int[] groupHeadsArray = c.retrieveGroupHeads()
 *   // Return all group heads as FixedBitSet.
 *   int maxDoc = s.maxDoc();
 *   FixedBitSet groupHeadsBitSet = c.retrieveGroupHeads(maxDoc)
 * </pre>
 *
 */
package org.apache.lucene.search.grouping;
/

org.apache.lucene/ lucene-grouping/ 8.2.0/ org/apache/lucene/search/grouping/package-info.java