Faceted Search with Lucene 4

Faceted search is a technique used on several ecommerce websites and search engines to allow users to refine their search results by narrowing down the scope of their queries to a category or a sub category.

amazon_facetsebay_facets

The facet implementation in Lucene allows to categorize documents by categories and subcategories, then get the list of categories of the documents matching a query and also to drill down to a specific category or a sub category.

In this post, we are going to write three programs:

  • an indexer
  • a searcher
  • an advanced searcher that narrows down the scope to a category or subcategory


You can find the sourcecode at: https://github.com/fredang/facet-lucene-example

To fetch the project on your machine:

git clone https://github.com/fredang/facet-lucene-example.git

Data

Let’s say we have several books stored in JSON format. Each book is composed of an id, a title, one or several authors, and a category:

[
	{
		"id": 1,
		"title": "a funny book",
		"authors": ["Jean Bon", "Alex Terieur"],
		"book_category": "/novel/comedy"
	},
	{
		"id": 2,
		"title": "a dramatic book",
		"authors": ["Alex Terieur"],
		"book_category": "/novel/drama"
	},
	{
		"id": 3,
		"title": "A hilarious book",
		"authors": ["Marc Assin", "Harry Covert"],
		"book_category": "/book/comedy"
	},
	{
		"id": 4,
		"title": "A sad story",
		"authors": ["Gerard Menvusa", "Alex Terieur"],
		"book_category": "/novel/drama"
	},
	{
		"id": 5,
		"title": "A very sad story",
		"authors": ["Gerard Menvusa", "Alain Terieur"],
		"book_category": "/novel/tragedy"
	}
]

Indexer

class FacetLuceneIndexer {
	private static Version LUCENE_VERSION = Version.LUCENE_40;
	public static void main(String args[]) throws Exception {
		if (args.length != 3) {
			System.err.println("Parameters: [index directory] [taxonomy directory] [json file]");
			System.exit(1);
		}

		String indexDirectory = args[0];
		String taxonomyDirectory = args[1];
		String jsonFileName = args[2];

		IndexWriterConfig writerConfig = new IndexWriterConfig(LUCENE_VERSION, new WhitespaceAnalyzer(LUCENE_VERSION));
		writerConfig.setOpenMode(OpenMode.CREATE);
		IndexWriter indexWriter = new IndexWriter(FSDirectory.open(new File(indexDirectory)), writerConfig);

		TaxonomyWriter taxonomyWriter = new DirectoryTaxonomyWriter(new MMapDirectory(new File(taxonomyDirectory)), OpenMode.CREATE);
		CategoryDocumentBuilder categoryDocumentBuilder = new CategoryDocumentBuilder(taxonomyWriter, new DefaultFacetIndexingParams());

		String content = IOUtils.toString(new FileInputStream(jsonFileName));
		JSONArray bookArray = new JSONArray(content);

		Field idField = new IntField("id", 0, Store.YES);
		Field titleField = new TextField("title", "", Store.YES);
		Field authorsField = new TextField("authors", "", Store.YES);
		Field bookCategoryField = new TextField("book_category", "", Store.YES);

		for(int i = 0 ; i < bookArray.length() ; i++) {
			Document document = new Document();

			JSONObject book = bookArray.getJSONObject(i);
			int id = book.getInt("id");
			String title = book.getString("title");
			String bookCategory = book.getString("book_category");

			List categoryPaths = new ArrayList();
			String authorsString = "";
			JSONArray authors = book.getJSONArray("authors");
			for(int j = 0 ; j < authors.length() ; j++) {
  				String author = authors.getString(j);
  				if (j > 0) {
					authorsString += ", ";
				}
				categoryPaths.add(new CategoryPath("author", author));
				authorsString += author;
			}
			categoryPaths.add(new CategoryPath("book_category" + bookCategory, '/'));
			categoryDocumentBuilder.setCategoryPaths(categoryPaths);
			categoryDocumentBuilder.build(document);

			idField.setIntValue(id);
			titleField.setStringValue(title);
			authorsField.setStringValue(authorsString);
			bookCategoryField.setStringValue(bookCategory);

			document.add(idField);
			document.add(titleField);
			document.add(authorsField);
			document.add(bookCategoryField);

			indexWriter.addDocument(document);

			System.out.printf("Book: id=%d, title=%s, book_category=%s, authors=%s\n",
				id, title, bookCategory, authors);
		}
		taxonomyWriter.commit();
		taxonomyWriter.close();

		indexWriter.commit();
		indexWriter.close();
	}
}

To index the categories(facets), Lucene is using a taxonomy writer that stores the categories and their hierarchy. To associate the categories to the document, we are using a CategoryDocumentBuilder. A category is described by CategoryPath.
We can define an author name as a category:

  • new CategoryPath(“author”, “Jon Deuf”)
  • or new CategoryPath(“author/Jon Deuf”, ‘/’). The last parameter defines the delimiter character.

Several author ‘categories’ can be added for the same document.

For the book category that is actually a category path, we can define it as:

  • new CategoryPath(“category”, “novel”, “comedy”)
  • or new CategoryPath(“category/novel/comedy”, ‘/’). The last parameter defines the delimited character.

Once we have defined all the category paths, we associate them to the document:

categoryDocumentBuilder.setCategoryPaths(categoryPaths);
categoryDocumentBuilder.build(document);

Behind the scene, it adds some extra fields to the document so they can be used for the faceted search.
For this reason, we are creating a new Document instance for each book that we index. Do not try to reuse the Document instance as described in some lucene optimization page as the next documents will accumulate the categories.

If you have checkout the project from github, you need to compile the project:

mvn clean compile assembly:single

Then you can index the file books.json:

$ java -cp target/facet-lucene-example-1.0-jar-with-dependencies.jar com.chimpler.example.FacetLuceneIndexer index taxonomy books.json

Book: id=1, title=a funny book, book_category=/novel/comedy, authors=["Jean Bon","Alex Terieur"]
Book: id=2, title=a dramatic book, book_category=/novel/drama, authors=["Alex Terieur"]
Book: id=3, title=A hilarious book, book_category=/book/comedy, authors=["Marc Assin","Harry Covert"]
Book: id=4, title=A sad story, book_category=/novel/drama, authors=["Gerard Menvusa","Alex Terieur"]
Book: id=5, title=A very sad story, book_category=/novel/tragedy, authors=["Gerard Menvusa","Alain Terieur"]

The first argument is the path to the index directory, the second argument is the path to the taxonomy directory and the last argument is the path to the book json file to index.

Searcher

class FacetLuceneSearcher {
	private static Version LUCENE_VERSION = Version.LUCENE_40;
	public static void main(String args[]) throws Exception {
		if (args.length != 3) {
			System.err.println("Parameters: [index directory] [taxonomy directory] [query]");
			System.exit(1);
		}

		String indexDirectory = args[0];
		String taxonomyDirectory = args[1];
		String query = args[2];

		IndexReader indexReader = DirectoryReader.open(FSDirectory.open(new File(indexDirectory)));
		IndexSearcher indexSearcher = new IndexSearcher(indexReader);

		TaxonomyReader taxonomyReader = new DirectoryTaxonomyReader(FSDirectory.open(new File(taxonomyDirectory)));

		FacetSearchParams searchParams = new FacetSearchParams(new DefaultFacetIndexingParams());
		searchParams.addFacetRequest(new CountFacetRequest(new CategoryPath("author"), 100));
		searchParams.addFacetRequest(new CountFacetRequest(new CategoryPath("book_category"), 100));

		ComplexPhraseQueryParser queryParser = new ComplexPhraseQueryParser(LUCENE_VERSION, "title", new StandardAnalyzer(LUCENE_VERSION));
		Query luceneQuery = queryParser.parse(query);

		// Collectors to get top results and facets
		TopScoreDocCollector topScoreDocCollector = TopScoreDocCollector.create(10, true);
		FacetsCollector facetsCollector = new FacetsCollector(searchParams, indexReader, taxonomyReader);
		indexSearcher.search(luceneQuery, MultiCollector.wrap(topScoreDocCollector, facetsCollector));
		System.out.println("Found:");

		for(ScoreDoc scoreDoc: topScoreDocCollector.topDocs().scoreDocs) {
			Document document = indexReader.document(scoreDoc.doc);
			System.out.printf("- book: id=%s, title=%s, book_category=%s, authors=%s, score=%f\n",
					document.get("id"), document.get("title"),
					document.get("book_category"),
					document.get("authors"),
					scoreDoc.score);
		}

		System.out.println("Facets:");
		for(FacetResult facetResult: facetsCollector.getFacetResults()) {
			System.out.println("- " + facetResult.getFacetResultNode().getLabel());
			for(FacetResultNode facetResultNode: facetResult.getFacetResultNode().getSubResults()) {
				System.out.printf("    - %s (%f)\n", facetResultNode.getLabel().toString(),
						facetResultNode.getValue());
				for(FacetResultNode subFacetResultNode: facetResultNode.getSubResults()) {
					System.out.printf("        - %s (%f)\n", subFacetResultNode.getLabel().toString(),
							subFacetResultNode.getValue());
				}
			}
		}
		taxonomyReader.close();
		indexReader.close();
	}
}

Similarly to the indexer, we are using a TaxonomyReader to read the categories.
Before doing the search we define a FacetsCollector to count for a set of categories the number of matching documents:

FacetSearchParams searchParams = new FacetSearchParams(new DefaultFacetIndexingParams());
searchParams.addFacetRequest(new CountFacetRequest(new CategoryPath("author"), 100));
searchParams.addFacetRequest(new CountFacetRequest(new CategoryPath("book_category"), 100));
FacetsCollector facetsCollector = new FacetsCollector(searchParams, indexReader, taxonomyReader);

The second parameter of the CountFacetRequest(100) represents the number of categories that we want to keep. Here we get only the 100 authors with the most documents matching the query. Same for the book categories.

In order to do the search and at the same time, collect the results and collect the facet, we are using a MultiCollector. After the search, we can have the facets information with the FacetsCollector.

To run the searcher:

$ java -cp target/facet-lucene-example-1.0-jar-with-dependencies.jar com.chimpler.example.FacetLuceneSearcher index taxonomy story 
Found:
- book: id=4, title=A sad story, book_category=/novel/drama, authors=Gerard Menvusa, Alex Terieur, score=0.755413
- book: id=5, title=A very sad story, book_category=/novel/tragedy, authors=Gerard Menvusa, Alain Terieur, score=0.755413
Facets:
- author
    - author/Gerard Menvusa (2.000000)
    - author/Alain Terieur (1.000000)
    - author/Alex Terieur (1.000000)
- book_category
    - book_category/novel (2.000000)

Similarly to the indexer, the first and second arguments are the directories where the indexes and the taxonomy are stored. Then the third argument is the query.

Advanced Searcher

To narrow down the search result to a category, we can use the DrillDown class:

Query newLuceneQuery = DrillDown.query(luceneQuery, new CategoryPath("book_category/novel", '/'));

It will return a new query which limits the scope to novels.

As we narrow down the scope to novel, we might want to have some info on the subcategories of novels(drama, tragedy), to do so, we just need to add a new CountFacetRequest:

searchParams.addFacetRequest(new CountFacetRequest(new CategoryPath("book_category/novel", '/'), 100));
class FacetLuceneAdvancedSearcher {
	private static Version LUCENE_VERSION = Version.LUCENE_40;
	public static void main(String args[]) throws Exception {
		if (args.length != 5) {
			System.err.println("Parameters: [index directory] [taxonomy directory] [query] [field drilldown] [value drilldown]");
			System.exit(1);
		}

		String indexDirectory = args[0];
		String taxonomyDirectory = args[1];
		String query = args[2];
		String fieldDrilldown = args[3];
		String valueDrilldown = args[4];

		IndexReader indexReader = DirectoryReader.open(FSDirectory.open(new File(indexDirectory)));
		IndexSearcher indexSearcher = new IndexSearcher(indexReader);

		TaxonomyReader taxonomyReader = new DirectoryTaxonomyReader(FSDirectory.open(new File(taxonomyDirectory)));

		CategoryPath drillDownCategoryPath = new CategoryPath(fieldDrilldown + "/" + valueDrilldown, '/');

		FacetSearchParams searchParams = new FacetSearchParams(new DefaultFacetIndexingParams());
		searchParams.addFacetRequest(new CountFacetRequest(new CategoryPath("author"), 100));
		searchParams.addFacetRequest(new CountFacetRequest(new CategoryPath("book_category"), 100));
		searchParams.addFacetRequest(new CountFacetRequest(drillDownCategoryPath, 100));

		ComplexPhraseQueryParser queryParser = new ComplexPhraseQueryParser(LUCENE_VERSION, "title", new StandardAnalyzer(LUCENE_VERSION));

		Query luceneQuery = queryParser.parse(query);
		luceneQuery = DrillDown.query(luceneQuery, drillDownCategoryPath);

		// Collectors to get top results and facets
		TopScoreDocCollector topScoreDocCollector = TopScoreDocCollector.create(10, true);
		FacetsCollector facetsCollector = new FacetsCollector(searchParams, indexReader, taxonomyReader);
		indexSearcher.search(luceneQuery, MultiCollector.wrap(topScoreDocCollector, facetsCollector));
		System.out.println("Found:");

		for(ScoreDoc scoreDoc: topScoreDocCollector.topDocs().scoreDocs) {
			Document document = indexReader.document(scoreDoc.doc);
			System.out.printf("- book: id=%s, title=%s, book_category=%s, authors=%s, score=%f\n",
					document.get("id"), document.get("title"),
					document.get("book_category"),
					document.get("authors"),
					scoreDoc.score);
		}

		System.out.println("Facets:");
		for(FacetResult facetResult: facetsCollector.getFacetResults()) {
			System.out.println("- " + facetResult.getFacetResultNode().getLabel());
			for(FacetResultNode facetResultNode: facetResult.getFacetResultNode().getSubResults()) {
				System.out.printf("    - %s (%f)\n", facetResultNode.getLabel().toString(),
						facetResultNode.getValue());
				for(FacetResultNode subFacetResultNode: facetResultNode.getSubResults()) {
					System.out.printf("        - %s (%f)\n", subFacetResultNode.getLabel().toString(),
							subFacetResultNode.getValue());
				}
			}
		}
		taxonomyReader.close();
		indexReader.close();
	}
}

To run the advanced searcher:

$ java -cp target/facet-lucene-example-1.0-jar-with-dependencies.jar com.chimpler.example.FacetLuceneAdvancedSearcher index taxonomy book book_category novel
Found:
 - book: id=1, title=a funny book, book_category=/novel/comedy, authors=Jean Bon, Alex Terieur, score=1.106425
 - book: id=2, title=a dramatic book, book_category=/novel/drama, authors=Alex Terieur, score=1.106425 
Facets:
 - author
     - author/Alex Terieur (2.000000)
     - author/Jean Bon (1.000000)
 - book_category
     - book_category/novel (2.000000)
     - book_category/novel
     - book_category/novel/drama (1.000000)
     - book_category/novel/comedy (1.000000)

In addition to the searcher, we have two additional arguments: the fourth one is for the category we want to drill down and the last one is for the value of the category.

You can limit the scope further down to the novel/comedy level by typing:

java -cp target/facet-lucene-example-1.0-jar-with-dependencies.jar com.chimpler.example.FacetLuceneAdvancedSearcher index taxonomy book book_category novel/comedy
Found:
- book: id=1, title=a funny book, book_category=/novel/comedy, authors=Jean Bon, Alex Terieur, score=1.944335
Facets:
- author
    - author/Alex Terieur (1.000000)
    - author/Jean Bon (1.000000)
- book_category
    - book_category/novel (1.000000)
- book_category/novel/comedy
Advertisements

About chimpler
http://www.chimpler.com

4 Responses to Faceted Search with Lucene 4

  1. GG says:

    I am trying this code with LUCENE Version 4.4. Its not giving result for facets. I have changed

    FacetSearchParams searchParams = new FacetSearchParams(new DefaultFacetIndexingParams());
    searchParams.addFacetRequest(new CountFacetRequest(new CategoryPath(“author”), 100));
    searchParams.addFacetRequest(new CountFacetRequest(new CategoryPath(“book_category”), 100));

    to

    List ls=new ArrayList();
    ls.add(new CountFacetRequest(new CategoryPath(“author”),100));
    ls.add(new CountFacetRequest(new CategoryPath(“book_category”),100));
    FacetSearchParams searchParams=new FacetSearchParams(new FacetIndexingParams(),ls);

  2. GG says:

    Instead of number of documents, can we change the parameter to what we desire??

  3. Jigar Shah says:

    Thanks mate its helpful

  4. k2013joseph says:

    this is really helpful, let me look into it in details 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: