Playing with Apache Hive and SOLR

2013/03/20 10 Comments

As described in a previous post, Apache SOLR can perform very well to provide low latency analytics. Data logs can be pre-aggregated using Hive and then synced to SOLR. To this end, we developed a simple Storage Handler for SOLR so that data can be read and written to SOLR transparently using an external table.

We will show in this post how to install our SOLR storage handler and then run a simple example where we sync some data from Hive to SOLR.

We suppose in this post that you have Hive (see our previous post on installing Hive) and SOLR installed (see our post on SOLR Cloud).

SOLR adapter installation

To install:

$ git clone http://github.org/chimpler/hive-solr
$ cd hive-solr
$ mvn package
$ cp target/hive-solr-0.0.1-SNAPSHOT-jar-with-dependencies.jar $HIVE_HOME/lib

SOLR example

Let’s use the SOLR schema provided in a previous post at:

$ git clone http://github.com/chimpler/blog-solr-cloud-example

Copy the schema from the directory /etc/solr/core0 to your solr configuration and restart your webserver:

<?xml version="1.0" ?>
<schema name="segment_overlap" version="1.1">
  <types>
   <fieldtype name="string"  class="solr.StrField" sortMissingLast="true"
              omitNorms="true"/>
   <fieldType name="int" class="solr.TrieIntField" precisionStep="0"
              positionIncrementGap="0"/>
   <fieldType name="float" class="solr.TrieFloatField" precisionStep="0"
              positionIncrementGap="0"/>
   <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
              positionIncrementGap="0"/>
   <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0"
              positionIncrementGap="0"/>
  </types>

 <fields>
   <field name="id" type="int" indexed="true" stored="true" required="true" />
   <field name="_version_" type="long" indexed="true" stored="true"
          required="true" />
   <field name="item_id" type="int" indexed="true" stored="true" required="true" />
   <field name="name" type="string" indexed="true" stored="true" required="true" />
   <field name="year" type="int" indexed="true" stored="true" required="true" />
   <field name="month" type="int" indexed="true" stored="true" required="true" />
   <field name="shipping_method" type="int" indexed="true" stored="true"
          required="true" />
   <field name="us_sold" type="int" indexed="true" stored="true" required="true" />
   <field name="ca_sold" type="int" indexed="true" stored="true" required="true" />
   <field name="fr_sold" type="int" indexed="true" stored="true" required="true" />
   <field name="uk_sold" type="int" indexed="true" stored="true" required="true" />
 </fields>

 <!-- field to use to determine and enforce document uniqueness. -->
 <uniqueKey>id</uniqueKey>

 <!-- field for the QueryParser to use when an explicit fieldname is absent -->
 <defaultSearchField>name</defaultSearchField>

 <!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
 <solrQueryParser defaultOperator="AND"/>
</schema>

Now let’s generate some data that we want to import in SOLR through Hive.

In the directory example:

$ python gen-core0-data.py | tail -n +2 > data.csv

We suppose that the core URL is: http://localhost:8983/solr/core0

Let’s import the data into Hive:

$ hive
hive> create table source (
    id INT,
    item_id INT,
    name STRING,
    year INT,
    month INT,
    shipping_method INT,
    us_sold INT,
    ca_sold INT,
    fr_sold INT,
    uk_sold INT)
    ROW FORMAT
    DELIMITED FIELDS TERMINATED BY ',';
OK

hive> LOAD DATA LOCAL INPATH 'data.csv' OVERWRITE INTO TABLE source;

Let’s create a SOLR external table solr_items and do a copy from the table source to solr_items:

hive> create external table solr_items (
    id INT,
    item_id INT,
    name STRING,
    year INT,
    month INT,
    shipping_method INT,
    us_sold INT,
    ca_sold INT,
    fr_sold INT,
    uk_sold INT
) stored by "com.chimpler.hive.solr.SolrStorageHandler"
with serdeproperties ("solr.column.mapping"="id,item_id,name,year,month,shipping_method,us_sold,ca_sold,fr_sold,uk_sold")
tblproperties ("solr.url" = "http://localhost:8983/solr/core0");
OK

hive> insert into table solr_items select * from source;
OK

You can optionally specify the buffer size to use when data is loaded or stored in SOLR by passing the following parameters in tblproperties:

solr.buffer.input.rows: number of rows to load in bulk when loading rows from SOLR. For instance: “solr.buffer.input.rows”=”10000” (default is 100,000)
solr.buffer.output.rows: number of rows to store in bulk when storing rows into SOLR. For instance: “solr.buffer.output.rows”=”10000” (default is 100,000)

Now let’s check if the data is in SOLR:

$ curl http://localhost:8983/solr/core0/select?wt=csv -d 'q=*:*'
id,_version_,shipping_method,fr_sold,uk_sold,ca_sold,item_id,name,us_sold,month,year
-48009990,1428822836035715072,0,1246,5050,796,0,item 0,376,1,1990
-48009989,1428822836035715073,1,6691,5718,2874,0,item 0,2029,1,1990
-48009988,1428822836035715074,2,5816,5114,5387,0,item 0,4927,1,1990
-48009987,1428822836035715075,3,8234,278,7206,0,item 0,1770,1,1990

Filed under hadoop Tagged with hadoop, hive, installation, low latency, solr

About chimpler
http://www.chimpler.com

10 Responses to Playing with Apache Hive and SOLR

Pankaj Khattar says:

2013/04/23 at 11:02 am

Hi

This is a good information regarding the integration of Hive & SOLR.

I have a quick question as i have not gone through the code which is there in git.
The SOLR storage handler which you have mentioned here, Is it creating lucene based index files on the SOLR server as soon as the data is loaded in the hive external table or the index files are created in HDFS/Hive itself & SOLR webapp access them through HDFS/Hive external table?

Can you please define the data flow?

Thanks.

Reply
- chimpler says:
  
  2013/04/23 at 6:50 pm
  
  Hi Pankaj. The SOLR storage handler will store the data in SOLR as soon as you insert data in the hive external table. In our case, SOLR doesn’t have to be aware at all of Hive/HDFS.
  
  More precisely, when you insert data in the external table in Hive, say 1 million rows, we break them into small chunks of 100K rows and send them to SOLR. At the end of the insert, a commit is executed on SOLR to make it index the data.
  
  Reply
Federico Méndez says:

2013/05/23 at 6:17 am

How the integrations works for non-stored fields (stored=”false”)? Is it possible to query the field but not modify it? …btw this is great!

Reply
- chimpler says:
  
  2013/05/29 at 12:21 pm
  
  That’s a good question. I would think that you would get null values for non-stored fields and won’t be able to query them. However you would be able to modify them so that they’re indexed in SOLR (then in SOLR you would be able to query them but not read their values).
  
  Reply
Bunny says:

2014/03/13 at 8:44 am

Hi, chimpler.
I always used your excellent jar on hive 0.11 from last year.
But after upgrading to hive 0.12, it can’t work!
Do you have plan to upgrade your libs?

Reply
Anupam says:

2014/06/02 at 9:23 am

Hi Chimpler,

Your example gave me a quick start & initial understanding. Thanks!

However when I am trying to utilize Hive Complex Data types like Array with your SolrStorageHandler Class I am facing errors. The basic example as per your blog works perfectly with SOLR.

It seems that in my case the type of tags field in Hive is getting changed to String rather than ArrayList.

I am basically trying to index a multivalued field in SOLR from a Hive table.
Below is the schema definition from SOLR which is a multivalued field.

Pls. see following log:

create table JOINED3(key STRING, author STRING, title STRING, tagname array)
row format delimited
fields terminated by ‘,’
collection items terminated by ‘$’;

hive> describe JOINED3;
OK
key string
author string
title string
tagname array
Time taken: 0.155 seconds

create table JOINED5 (key STRING, author STRING, title STRING, tags ARRAY)
stored by “com.chimpler.hive.solr.SolrStorageHandler”
with serdeproperties (“solr.column.mapping”=”key,author,title,tags”)
tblproperties (“solr.url” = “http://localhost:8983/solr/collection”);

hive> describe JOINED5;
OK
key string from deserializer
author string from deserializer
title string from deserializer
tags string from deserializer
Time taken: 0.125 seconds

hive> insert into table JOINED5 select * from JOINED3;
FAILED: NoMatchingMethodException No matching method for class org.apache.hadoop.hive.ql.udf.UDFToString with (array). Possible choices: _FUNC_(bigint) _FUNC_(binary) _FUNC_(boolean) _FUNC_(decimal) _FUNC_(double) _FUNC_(float) _FUNC_(int) _FUNC_(smallint) _FUNC_(string) _FUNC_(timestamp) _FUNC_(tinyint) _FUNC_(void)

I will be grateful if you can help to explain where I am going wrong.

Regards
Anupam

Reply
Mahesh says:

2014/06/06 at 12:54 am

Hi Chimpler,

i got the following error:
Error: java.lang.RuntimeException: org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: com.chimpler.hive.solr.SolrInputFormat

Thanks

Reply
- Pierre-Yves Dupont says:
  
  2014/09/04 at 8:43 am
  
  Hi,
  
  The same exception occurs when I try to fill solr_items with the following command :
  insert into table solr_items select * from source;
  
  I succeded in :
  – package jar with maven and move it in hive libs (I unzip it in order to know if SolrInputFormat class was present)
  – create Solr Core (I see it in solr web IHM)
  – create source and solr_items tables
  – fill source table (data are well returned when executing a select)
  
  When executing insert, mapreduce job is launched but fails in reduce part.
  
  There’s the trace :
  
  Total MapReduce jobs = 1
  Launching Job 1 out of 1
  Number of reduce tasks is set to 0 since there’s no reduce operator
  Starting Job = job_1409671122713_0009, Tracking URL = http://host1:8088/proxy/application_1409671122713_0009/
  Kill Command = /opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/hadoop/bin/hadoop job -kill job_1409671122713_0009
  Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
  2014-09-04 14:30:35,430 Stage-0 map = 0%, reduce = 0%
  2014-09-04 14:31:00,217 Stage-0 map = 100%, reduce = 0%
  Ended Job = job_1409671122713_0009 with errors
  Error during job, obtaining debugging information…
  Examining task ID: task_1409671122713_0009_m_000000 (and more) from job job_1409671122713_0009
  
  Task with the most failures(4):
  —–
  Task ID:
  task_1409671122713_0009_m_000000
  
  URL:
  http://host1:8088/taskdetails.jsp?jobid=job_1409671122713_0009&tipid=task_1409671122713_0009_m_000000
  —–
  Diagnostic Messages for this Task:
  Error: java.lang.RuntimeException: org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: com.chimpler.hive.solr.SolrInputFormat
  Serialization trace:
  inputFileFormatClass (org.apache.hadoop.hive.ql.plan.TableDesc)
  tableInfo (org.apache.hadoop.hive.ql.plan.FileSinkDesc)
  conf (org.apache.hadoop.hive.ql.exec.FileSinkOperator)
  childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator)
  childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator)
  aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork)
  at org.apache.hadoop.hive.ql.exec.Utilities.getBaseWork(Utilities.java:320)
  at org.apache.hadoop.hive.ql.exec.Utilities.getMapWork(Utilities.java:259)
  at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:251)
  at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:378)
  at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:371)
  at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:556)
  at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:168)
  at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:409)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:415)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
  Caused by: org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: com.chimpler.hive.solr.SolrInputFormat
  Serialization trace:
  inputFileFormatClass (org.apache.hadoop.hive.ql.plan.TableDesc)
  tableInfo (org.apache.hadoop.hive.ql.plan.FileSinkDesc)
  conf (org.apache.hadoop.hive.ql.exec.FileSinkOperator)
  childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator)
  childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator)
  aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork)
  at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138)
  at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115)
  at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:656)
  at org.apache.hive.com.esotericsoftware.kryo.serializers.DefaultSerializers$ClassSerializer.read(DefaultSerializers.java:238)
  at org.apache.hive.com.esotericsoftware.kryo.serializers.DefaultSerializers$ClassSerializer.read(DefaultSerializers.java:226)
  at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:745)
  at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:113)
  at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
  at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
  at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
  at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
  at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
  at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
  at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
  at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
  at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112)
  at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
  at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
  at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
  at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
  at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
  at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112)
  at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
  at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
  at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
  at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
  at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
  at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:139)
  at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
  at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
  at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
  at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
  at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:672)
  at org.apache.hadoop.hive.ql.exec.Utilities.deserializeObjectByKryo(Utilities.java:829)
  at org.apache.hadoop.hive.ql.exec.Utilities.deserializePlan(Utilities.java:739)
  at org.apache.hadoop.hive.ql.exec.Utilities.deserializePlan(Utilities.java:752)
  at org.apache.hadoop.hive.ql.exec.Utilities.getBaseWork(Utilities.java:292)
  … 13 more
  Caused by: java.lang.ClassNotFoundException: com.chimpler.hive.solr.SolrInputFormat
  at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:270)
  at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:136)
  … 49 more
  
  FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
  MapReduce Jobs Launched:
  Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
  Total MapReduce CPU Time Spent: 0 msec
  
  Reply
  - Pierre-Yves Dupont says:
    
    2014/09/04 at 9:00 am
    
    like Bunny, I use hive 0.12.
  - Pierre-Yves Dupont says:
    
    2014/09/04 at 10:45 am
    
    I finally solve my problem. Chimpler’s hive-solr.jar works fine on hive 0.12.
    I didn’t pasted the .jar in the right hive library directory.
    For weird reasons, hive throws a ClassNotFoundException on com.chimpler.hive.solr.SolrInputFormat when executing insert command and not when executing select and create statements.
    
    For Cloudera users like me, be aware to create in each instance concerned a directory in which you will copy chimpler’s hive-solr jar. Then in Hive Service configuration, fill “hive aux library directory” parameter with the path of the directories you’ve just created.