Painless guide to Solr Cloud configuration

“Cloud” become very ambiguous term and it can mean virtually anything those days. If you are not familiar with Solr Cloud think about it as one logical service hosted on multiple servers. Distributed architecture helps with scaling, fault tolerance, distributed indexing and generally speaking improves search capabilities.

All of that is very exciting and I’m highly impressed how the service is designed but… it’s relatively new product. If you play with the tutorial (which by the way is great) running multiple services on the same host does’t cause problems. Setting the service for production environment and using it is a different story. There are still some unresolved issues which can confuse for hours if not days.

I would like to share my experience with setting up Solr Cloud and highlight problems I came across. If you are completely new to the subject I recommend you to read Solr Cloud tutorial first. If you new to Solr have a look into my previous post.

All applications we are going to use in this post are written in Java. I’m doing my best to setup the service up to the highest standards but I’m not a Java developer. There might be some things which could be done better. If that’s the case I would love to hear your feedback.

Goals and assumptions for this tutorial are:

  • Solr 4.2.1 – this’s the latest version of Solr at the time of writing this post.
  • Tomcat7 – Solr comes with Jetty which is very helpful for development although our goal is production setup. The service will have to be monitor and maintained by sysops. They are usually more familiar with TomCat7 than other solutions.
  • External ZooKeeper – Solr has ZooKeeper build in but it’s not recommended to use it in production.
  • Operating system Ubuntu 12.04 – personal preferences (sorry RedHat folks).

If you would like to try this setup but you don’t have an access to multiple servers there are two options:

  • Use amazon EC2 micro instances (it’s free).
  • Use virtualisation – I recommend VMWare player. It’s free and fast.

A journey of a thousand miles begins with a single step. Lets log in to the first server and download ZooKeeper, Solr and TomCat7.

$ sudo apt-get install tomcat7 tomcat7-admin
$ wget http://apache.mirrors.timporter.net/zookeeper/current/zookeeper-3.4.5.tar.gz
$ wget http://mirrors.ukfast.co.uk/sites/ftp.apache.org/lucene/solr/4.2.1/solr-4.2.1.tgz

My links might be out of date. Make sure you download the latest version of ZooKeeper and Solr.

Before we do any configuration lets check your host name.

$ hostname
ubuntu

Now look for it in /etc/hosts

$ sudo vim /etc/hosts

If you find something like this:

127.0.1.1       ubuntu

Change the IP to your LAN IP address. This tiny thing gave me guru meditation for few days. It will make at least one of your Solr nodes to register as “127.0.1.1”. Localhost address doesn’t make any sense from cloud’s point of view. That will populate multiple issues with replication and leader election. It’s later hard to guess that all of those problems come from this silly source. Don’t repeat my mistake.

Unpack downloaded software.

$ tar zxfv zookeeper-3.4.5.tgz
$ tar zxfv solr-4.2.1.tgz

The easier job is to setup ZooKeeper. You will do it only once on the first server. ZooKeeper scales well and you can run it on all Solr nodes if required. There is no need for this at the moment so we can take single server approach.

Create a directory for zookeeper data and set configuration to point to that place.

$ sudo mkdir -p /var/lib/zookeeper
$ cd zookeeper-3.4.5/
$ cp conf/zoo_sample.cfg conf/zoo.cfg
$ vim conf/zoo.cfg

Find dataDir and paste appropriate path.

dataDir=/var/lib/zookeeper

Start ZooKeeper.

$ bin/zkServer.sh start
JMX enabled by default
Using config: /home/lukasz/zookeeper-3.4.5/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED

If you like you can use ZooKeeper client to connect with the server.

$ bin/zkCli.sh -server 127.0.0.1:2181
[zk: 127.0.0.1:2181(CONNECTED) 0] ls /
[zookeeper]

Type “quit” to exit.

Now lets insert Solr configuration into ZooKeeper. Go to Solr directory and have a look into solr-webapp. It should be empty.

$ cd solr-4.2.1/example/
$ ls solr-webapp/ 

Please notice I’m using the example directory. In real live you obviously want to rename it to something better. The same with collections. I’m going to use the default collection1 for this tutorial.

If your solr-webapp doesn’t have solr.war inside run Solr for few seconds to make it extract the file.

# java -jar start.jar
2013-04-05 09:38:58.132:INFO:oejs.Server:jetty-8.1.8.v20121106
2013-04-05 09:38:58.150:INFO:oejdp.ScanningAppProvider:Deployment monitor /root/solr-4.2.1/example/contexts at interval 0
2013-04-05 09:38:58.153:INFO:oejd.DeploymentManager:Deployable added: /root/solr-4.2.1/example/contexts/solr-jetty-context.xml
2013-04-05 09:38:58.209:INFO:oejw.WebInfConfiguration:Extract jar:file:/root/solr-4.2.1/example/webapps/solr.war!/ to /root/solr-4.2.1/example/solr-webapp/webapp

After this line you can press ctrl+c to stop the server.

$ ls webapps/solr.war
webapps/solr.war

Now we can start uploading configuration to the ZooKeeper.

$ cloud-scripts/zkcli.sh -cmd upconfig -zkhost 127.0.0.1:2181 -d solr/collection1/conf/ -n default1
$ cloud-scripts/zkcli.sh -cmd linkconfig -zkhost 127.0.0.1:2181 -collection collection1 -confname default1 -solrhome solr
$ cloud-scripts/zkcli.sh -cmd bootstrap -zkhost 127.0.0.1:2181 -solrhome solr

If you would like to learn more about the zkcli script have a look here http://docs.lucidworks.com/display/solr/Command+Line+Utilities.

Now if you login to ZooKeeper and run “ls /” command you should see the uploaded data.

$ bin/zkCli.sh -server 127.0.0.1:2181

[zk: localhost:2181(CONNECTED) 0] ls /
[configs, zookeeper, clusterstate.json, aliases.json, live_nodes, overseer, collections, overseer_elect]

[zk: 127.0.0.1:2181(CONNECTED) 1] ls /configs
[default1]

[zk: 127.0.0.1:2181(CONNECTED) 3] ls /configs/default1
[admin-extra.menu-top.html, currency.xml, protwords.txt, mapping-FoldToASCII.txt, solrconfig.xml, lang, stopwords.txt, spellings.txt, mapping-ISOLatin1Accent.txt, admin-extra.html, xslt, scripts.conf, synonyms.txt, update-script.js, velocity, elevate.xml, admin-extra.menu-bottom.html, schema.xml]

[zk: localhost:2181(CONNECTED) 4]  get /configs/default1/schema.xml 

// content of your schema.xml

[zk: 127.0.0.1:2181(CONNECTED) 5] quit
Quitting…

This step is obviously not required but it’s good to know what happens inside each service and how to get there.

If you impatient then you can go to “solr-4.2.1/example/” and run the service.

$ java -DzkHost=localhost:2181 -jar start.jar

it should work in Cloud mode and if you are happy with running it that way you can skip TomCat setup. If that’s the case visit http://SERVER01_IP:8080/solr/#/~cloud to confirm it’s working.

If you go to that URL have a look into first sub item in the navigation. It’s called “Tree”. Does it look familiar? Yes, it’s ZooKeeper’s data.

The final step is to setup TomCat. Stop Solr (Ctrl + c) if you run it and go to TomCat’s directory.

$ cd /etc/tomcat7/Catalina/localhost/
$ vim solr.xml

Paste below configuration. Make sure docBase and Environment path match your setup.




     

Enable admin user for TomCat.

$ vim /etc/tomcat7/tomcat-users.xml

Add



to tomcat-users tag.

You are almost there. The last thing is to “tell” Solr to use ZooKeeper. We already know how to do it from command line. When you run Solr from an external container you have to edit solr.xml.

$ vim solr-4.2.1/example/solr/solr.xml

Find top tag called solr and add zkHost attribute.


While you are editing solr.xml go to cores tag and set hostPort attribute to 8080.


Restart Tomcat.

$ sudo /etc/init.d/tomcat7 restart

Open web browser and go to http://SERVER01_IP:8080/manager/html. You will be asked for username and password which you set in the previous step (admin/secret).

Find “/solr” in Applications list and click on “start” in commands column. If it fails with a message “FAIL – Application at context path /solr could not be started” it’s most likely permissions issue. You can resolve it with

$ chown tomcat7.tomcat7 -R /home/lukasz/solr-4.2.1/

If it still doesn’t work you can troubleshoot it in “/var/log/tomcat7/catalina.*.log”.

Once the service is running you can access it under http://SERVER01_IP:8080/solr/#/.

That was first server. To have a cloud you need at least one more. The steps are exactly the same with a difference you can skip everything related to ZooKeeper. Make sure to set correct IP address for zkHost in solr.xml.

Run second server and go to http://SERVER01_IP:8080/solr/#/~cloud. You should see two servers replicating Collection1.

solr01

Just to remind you. If one of your servers has local IP like 127.0.1.1 there is a problem with your /etc/hosts file. If you made any mistake you can always start again. Stop TomCat servers, login to ZooKeeper and remove “clusterstate.json”.

[zk: localhost:2181(CONNECTED) 1] rmr /clusterstate.json

Now you can insert some data into your index.

$ cd solr-4.2.1/example/exampledocs/
$ vim post.sh

Bash script needs to be updated because it points to the default port.

URL=http://localhost:8080/solr/update

Run the script.

$ ./post.sh mem.xml
Posting file mem.xml to http://localhost:8080/solr/update


075




064

So far so good. Now lets use Collections API to create a new collection.

http://ONE_OF_YOUR_SERVERS:8080/solr/admin/collections?action=CREATE&name=hello
&numShards=2&replicationFactor=1&collection.configName=default1

This should add new collection called “hello”. The collection will use previously uploaded configuration “default1” and is going to be split into both server.

solr02

That looks more interesting. If you click on core selector (bottom of the left hand side navigation) you will notice the core is called “hello_shard1_replica1”. On the other server the name will be “hello_shard2_replica1”. You can still use “hello” name to query any of the server, for example:

http://ONE_OF_YOUR_SERVERS:8080/solr/hello/select?q=*%3A*&wt=xml&indent=true

If you are not on Solr 4.3 yet you have to be aware of very confusing bug – SOLR-4584. On some occasions you might not wish to store particular index on every server. For example, your cloud consists of 3 servers and you set shards to 1 and replication factor to 2. If you make a query to a server which physically don’t store the data you will get an error. This is obviously undesired behavior and will get fix. Right now you have to live with it so my recommendation is to use all servers.

It takes some effort to set everything up but it’s definitely worth it. There are some problems around Solr and the Cloud setup could be easier but I’m convinced all of those issues will be eventually addressed. If you still have some capacity for more Solr knowledge watch this speech: Solr 4: The SolrCloud Architecture.

3 thoughts on “Painless guide to Solr Cloud configuration

  1. Hello,

    I am new to SolrCloud and Zookeeper. Can you please explain why I need to upload the configuration files to Zookeeper using zkCli.sh?

    Thanks.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s