Java and the Lucene search library [6] form the basis for the search engine framework Apache Solr [1]. In the previous three articles, we set up Apache Solr on the soon-to-be-released Debian GNU/Linux 11 “Bullseye,” which initiated a single data core, uploaded example data, and demonstrated how to query output data in different ways and post-process it [2,3]. In part 3 [4], you have learned how to connect the relational database management system PostgreSQL [5] to Apache Solr and initiated a search in it.

The more documents you have to manage, the longer the answer time on a single-core setup. A multi-core Solr cluster helps to substantially reduce this answer time and increase the effectiveness of the setup. This article demonstrates how to do that and which traps to avoid.

Why and when taking clustering into account

To begin with, you need to understand what the term clustering stands for, why it is helpful to think about it, and especially when, how, and for who. There is no super-effective, all-inclusive recipe but several general criteria for the cluster setup that balance the load and help you keep your search engine’s answer time within a specific time range. This helps to run the search engine cluster reliably.

Generally speaking, the term clustering refers to a grouping of components that are similar to each other. Regarding Apache Solr, this means that you break down a large number of documents into smaller subsets based on the criteria you choose. You assign each subset to a single Apache Solr instance.

Instead of keeping all the documents in a single database, you store them in different topic-related databases or based on the letter range — for example, based on the first letter of the author’s last name. The first one goes from A to L and the second one from M to Z. To find information about books from Ernest Hemmingway, you have to look for them in the first database as the letter H is located alphabetically between A and L.

This setup already reduces your search area by 50% and, based on the assumption of an equally distributed number of book entries, reduces the search time likewise. In Apache Solr, this concept is called shard or slice, which describes a logical section of a single collection.

Someone who has only 500 documents can still easily handle the search based on a single core. In contrast, someone who has to manage a library of 100,000 documents needs a way to keep the response time within a certain level — if it takes too long, the provided service will not be used, and instead, the user will complain that searching takes way too long.

Also, the idealization is that two cores immediately reduce the search time by 50% and three cores by 66%, which is not true. The improvement is non-linear and about 1.5 (two cores) to 1.2 (three to four cores in a cluster). This non-linear improvement is known as Amdahl’s Law [7]. The additional time comes from the overhead needed to run the single cores, coordinate the search processes, and manage its results. In general, there is a remarkable improvement, but non-linear and only up to a certain point. In certain circumstances, even five or more parallel cores already form the boundary and have the same response time as four cores but require remarkably more resources than hardware, energy, and bandwidth.

Clustering in Apache Solr in more detail

So far, our Solr-based search engine consists of only a single node or core. The next level is to run more than one node or core in parallel to process more than one search request at a time.

A Solr cluster is a set of single Solr nodes. Also, a cluster itself can contain many document collections. The architectural principle behind Solr is non-master-slave. As a result, every Solr node is a master of its own.

The first step towards fault tolerance and higher availability is running a single Solr instance as separate processes. For the coordination between the different operations, Apache Zookeeper [8] comes into play. ZooKeeper describes itself as “a centralized service for maintaining configuration information, naming, providing distributed synchronization and providing group services.”

To go even more significantly, Apache Solr includes the ability to set up an entire cluster of various Solr servers called SolrCloud [9]. Using SolrCloud, you can profit from distributed indexing and search capabilities designed to handle an even more significant number of indexed documents.

Run Apache Solr with more than a single core as a collection

As already described in part 1 of this article series [2], Apache Solr runs under the user solr. The project directory under /opt/solr-8.7.0 (adjust the version number according to the Apache Solr version you use) and the variable data directory under /var/solr must belong to the solr user. If not done yet, you can achieve this as the root user with the help of these two commands:

# chmod -R solr:solr /var/solr

# chmod -R solr:solr /opt/solr-8.7.0

The next step is starting Apache Solr in cloud mode. As user solr, run the script in the following way:

With this command, you start an interactive session to set up an entire SolrCloud cluster with embedded ZooKeeper. First, specify how many nodes the Solr cluster should consist of. The range is between 1 and 4, and the default value is 2:

Welcome to the SolrCloud example!

This interactive session will help you launch a SolrCloud cluster on your local workstation.


To begin, how many Solr nodes would you like to run in your local cluster? (specify 14 nodes) [2]

Next, the script bin/solr prompts you for the port to bind each of the Solr nodes to. For the 1st node, it suggests port #8983, and for the 2nd node the port #7574 as follows:

Please enter the port for node1 [8983]

Please enter the port for node2 [7574]

You can choose any available port here. Please make sure beforehand that other network services are not yet using the specified ports. However, at least for the example used here, it is recommended to keep the default values. After answering the question, the script bin/solr starts the individual nodes one by one. Internally, it executes the following commands:

$ bin/solr start -cloud -s example/cloud/node1/solr -p 8983

$ bin/solr start -cloud -s example/cloud/node2/solr -p 7574

The figure below demonstrates this step for the first node. The output of the second node is likewise.

<img alt="" data-lazy- data-lazy-src="https://kirelos.com/wp-content/uploads/2021/06/echo/1-4.jpg" data-lazy- height="513" src="data:image/svg xml,” width=”1001″>

Simultaneously, the first node will also start an embedded ZooKeeper server. This server is bound to port #9983. The example call above the Solr home for the first node is the directory example/cloud/node1/solr as indicated by the -s option. The figure below shows the corresponding status messages.

<img alt="" data-lazy- data-lazy-src="https://kirelos.com/wp-content/uploads/2021/06/echo/2-4.jpg" data-lazy- height="489" src="data:image/svg xml,” width=”1003″>

Having started the two nodes in the cluster, the script will ask you for some more information — the name of the collection to create. The default value is getting started that we substitute by cars from part 2 of this article series [3] here:

Please provide a name for your new collection: [gettingstarted] cars

This entry is similar to the following script call that allows you to create the document collection cars individually:

$ bin/solr create_collection -c cars

Finally, the script prompts you for the number of shards and the number of replicas per shard. For this case, we stick to the default values of 2 shards and 2 replicas per shard. This allows you to understand how a collection is distributed across multiple nodes in a SolrCloud cluster, and SolrCloud handles the replication feature.

<img alt="" data-lazy- data-lazy-src="https://kirelos.com/wp-content/uploads/2021/06/echo/3-4.jpg" data-lazy- height="291" src="data:image/svg xml,” width=”1002″>

Now their Solr Cluster is up and running and ready to go. There are several changes in the Solr Administration panel, like additional menu entries for cloud and collections. The three figures below show the information that is available about the previously created cloud. The first image displays the node state and its current usage.

<img alt="" data-lazy- data-lazy-src="https://kirelos.com/wp-content/uploads/2021/06/echo/4-4.jpg" data-lazy- height="642" src="data:image/svg xml,” width=”996″>

The second image displays the organization of the cloud as a directed graph. Each active node is green with its name, IP address, and port number as previously defined. You find this information under the menu entry Cloud and in the submenu Graph.

<img alt="" data-lazy- data-lazy-src="https://kirelos.com/wp-content/uploads/2021/06/echo/5-5.jpg" data-lazy- height="642" src="data:image/svg xml,” width=”1000″>

The third image displays information about the collection of cars as well as its shards and replicas. To see the details for the collection, click on the menu entry “cars” that is located right of the main menu and below the button “Add Collection.” The corresponding shard information becomes visible if you click on the bold text labeled “Shard: shard1” and “Shard2”.

<img alt="" data-lazy- data-lazy-src="https://kirelos.com/wp-content/uploads/2021/06/echo/6-2.jpg" data-lazy- height="642" src="data:image/svg xml,” width=”996″>

Apache Solr also provides information on the command line. For this purpose, it offers the subcommand healthcheck. As additional parameters, enter -c followed by the name of the collection. In our case, the command is as follows to run the check on the cars collection:

$ bin/solr healthcheck -c cars

The information is returned as a JSON file and shown below.

<img alt="" data-lazy- data-lazy-src="https://kirelos.com/wp-content/uploads/2021/06/echo/7-1.jpg" data-lazy- height="612" src="data:image/svg xml,” width=”999″>

As explained in the Solr manual, the healthcheck command collects basic information about each replica in a collection. This covers the number of Documents, its current status like active or down, and the address — where the replica is located in the SolrCloud. Finally, you can now add Documents to SolrCloud. The call below adds the XML files to the cluster that are stored in the directory datasets/cars:

$ bin/post -c cars datasets/cars/*.xml

The uploaded data is distributed to the different cores and ready to be queried from there. See the previous articles on how to do that.

Conclusion

Apache Solr is designed to handle a large number of data sets. To minimize the answer time, run Solr as a cluster, as explained before. It needs a few steps, but we think it is worth having happier users of your document storage.

About the authors

Jacqui Kabeta is an environmentalist, avid researcher, trainer, and mentor. In several African countries, she has worked in the IT industry and NGO environments.

Frank Hofmann is an IT developer, trainer, and author and prefers to work from Berlin, Geneva, and Cape Town. Co-author of the Debian Package Management Book available from dpmb.org

Thank you

The authors would like to thank Saif du Plessis for his help while preparing the article.

Links and References