Painless guide to Solr Cloud configuration

Posted by on Apr 6, 2013 in Cloud Computing, Environment, Linux | 3 Comments

“Cloud” become very ambiguous term and it can mean virtually anything those days. If you are not familiar with Solr Cloud think about it as one logical service hosted on multiple servers. Distributed architecture helps with scaling, fault tolerance, distributed indexing and generally speaking improves search capabilities.

All of that is very exciting and I’m highly impressed how the service is designed but… it’s relatively new product. If you play with the tutorial (which by the way is great) running multiple services on the same host does’t cause problems. Setting the service for production environment and using it is a different story. There are still some unresolved issues which can confuse for hours if not days.

I would like to share my experience with setting up Solr Cloud and highlight problems I came across. If you are completely new to the subject I recommend you to read Solr Cloud tutorial first. If you new to Solr have a look into my previous post.

All applications we are going to use in this post are written in Java. I’m doing my best to setup the service up to the highest standards but I’m not a Java developer. There might be some things which could be done better. If that’s the case I would love to hear your feedback.

Goals and assumptions for this tutorial are:

  • Solr 4.2.1 – this’s the latest version of Solr at the time of writing this post.
  • Tomcat7 – Solr comes with Jetty which is very helpful for development although our goal is production setup. The service will have to be monitor and maintained by sysops. They are usually more familiar with TomCat7 than other solutions.
  • External ZooKeeper – Solr has ZooKeeper build in but it’s not recommended to use it in production.
  • Operating system Ubuntu 12.04 – personal preferences (sorry RedHat folks).

If you would like to try this setup but you don’t have an access to multiple servers there are two options:

  • Use amazon EC2 micro instances (it’s free).
  • Use virtualisation – I recommend VMWare player. It’s free and fast.

A journey of a thousand miles begins with a single step. Lets log in to the first server and download ZooKeeper, Solr and TomCat7.

My links might be out of date. Make sure you download the latest version of ZooKeeper and Solr.

Before we do any configuration lets check your host name.

Now look for it in /etc/hosts

If you find something like this:

Change the IP to your LAN IP address. This tiny thing gave me guru meditation for few days. It will make at least one of your Solr nodes to register as “127.0.1.1″. Localhost address doesn’t make any sense from cloud’s point of view. That will populate multiple issues with replication and leader election. It’s later hard to guess that all of those problems come from this silly source. Don’t repeat my mistake.

Unpack downloaded software.

The easier job is to setup ZooKeeper. You will do it only once on the first server. ZooKeeper scales well and you can run it on all Solr nodes if required. There is no need for this at the moment so we can take single server approach.

Create a directory for zookeeper data and set configuration to point to that place.

Find dataDir and paste appropriate path.

Start ZooKeeper.

If you like you can use ZooKeeper client to connect with the server.

Type “quit” to exit.

Now lets insert Solr configuration into ZooKeeper. Go to Solr directory and have a look into solr-webapp. It should be empty.

Please notice I’m using the example directory. In real live you obviously want to rename it to something better. The same with collections. I’m going to use the default collection1 for this tutorial.

If your solr-webapp doesn’t have solr.war inside run Solr for few seconds to make it extract the file.

After this line you can press ctrl+c to stop the server.

Now we can start uploading configuration to the ZooKeeper.

If you would like to learn more about the zkcli script have a look here http://docs.lucidworks.com/display/solr/Command+Line+Utilities.

Now if you login to ZooKeeper and run “ls /” command you should see the uploaded data.

This step is obviously not required but it’s good to know what happens inside each service and how to get there.

If you impatient then you can go to “solr-4.2.1/example/” and run the service.

it should work in Cloud mode and if you are happy with running it that way you can skip TomCat setup. If that’s the case visit http://SERVER01_IP:8080/solr/#/~cloud to confirm it’s working.

If you go to that URL have a look into first sub item in the navigation. It’s called “Tree”. Does it look familiar? Yes, it’s ZooKeeper’s data.

The final step is to setup TomCat. Stop Solr (Ctrl + c) if you run it and go to TomCat’s directory.

Paste below configuration. Make sure docBase and Environment path match your setup.

Enable admin user for TomCat.

Add

to tomcat-users tag.

You are almost there. The last thing is to “tell” Solr to use ZooKeeper. We already know how to do it from command line. When you run Solr from an external container you have to edit solr.xml.

Find top tag called solr and add zkHost attribute.

While you are editing solr.xml go to cores tag and set hostPort attribute to 8080.

Restart Tomcat.

Open web browser and go to http://SERVER01_IP:8080/manager/html. You will be asked for username and password which you set in the previous step (admin/secret).

Find “/solr” in Applications list and click on “start” in commands column. If it fails with a message “FAIL – Application at context path /solr could not be started” it’s most likely permissions issue. You can resolve it with

If it still doesn’t work you can troubleshoot it in “/var/log/tomcat7/catalina.*.log”.

Once the service is running you can access it under http://SERVER01_IP:8080/solr/#/.

That was first server. To have a cloud you need at least one more. The steps are exactly the same with a difference you can skip everything related to ZooKeeper. Make sure to set correct IP address for zkHost in solr.xml.

Run second server and go to http://SERVER01_IP:8080/solr/#/~cloud. You should see two servers replicating Collection1.

solr01

Just to remind you. If one of your servers has local IP like 127.0.1.1 there is a problem with your /etc/hosts file. If you made any mistake you can always start again. Stop TomCat servers, login to ZooKeeper and remove “clusterstate.json”.

Now you can insert some data into your index.

Bash script needs to be updated because it points to the default port.

Run the script.

So far so good. Now lets use Collections API to create a new collection.

http://ONE_OF_YOUR_SERVERS:8080/solr/admin/collections?action=CREATE&name=hello
&numShards=2&replicationFactor=1&collection.configName=default1

This should add new collection called “hello”. The collection will use previously uploaded configuration “default1″ and is going to be split into both server.

solr02

That looks more interesting. If you click on core selector (bottom of the left hand side navigation) you will notice the core is called “hello_shard1_replica1″. On the other server the name will be “hello_shard2_replica1″. You can still use “hello” name to query any of the server, for example:

http://ONE_OF_YOUR_SERVERS:8080/solr/hello/select?q=*%3A*&wt=xml&indent=true

If you are not on Solr 4.3 yet you have to be aware of very confusing bug – SOLR-4584. On some occasions you might not wish to store particular index on every server. For example, your cloud consists of 3 servers and you set shards to 1 and replication factor to 2. If you make a query to a server which physically don’t store the data you will get an error. This is obviously undesired behavior and will get fix. Right now you have to live with it so my recommendation is to use all servers.

It takes some effort to set everything up but it’s definitely worth it. There are some problems around Solr and the Cloud setup could be easier but I’m convinced all of those issues will be eventually addressed. If you still have some capacity for more Solr knowledge watch this speech: Solr 4: The SolrCloud Architecture.

3 Comments

  1. Saqib Ali
    11/07/2013

    Hello,

    I am new to SolrCloud and Zookeeper. Can you please explain why I need to upload the configuration files to Zookeeper using zkCli.sh?

    Thanks.

    Reply
    • Lukasz Kujawa
      16/07/2013

      Hi, there is no requirement to use zkCli.sh per say. You can use any other method you find appropriate.

      Reply
  2. bold
    29/09/2013

    thx for the 127.0.1.1 advice~~

    Reply

Leave a Reply