Playing with Apache Hive and SOLR

As described in a previous post, Apache SOLR can perform very well to provide low latency analytics. Data logs can be pre-aggregated using Hive and then synced to SOLR. To this end, we developed a simple Storage Handler for SOLR so that data can be read and written to SOLR transparently using an external table.

We will show in this post how to install our SOLR storage handler and then run a simple example where we sync some data from Hive to SOLR.

We suppose in this post that you have Hive (see our previous post on installing Hive) and SOLR installed (see our post on SOLR Cloud).

SOLR adapter installation

To install:

$ git clone http://github.org/chimpler/hive-solr
$ cd hive-solr
$ mvn package
$ cp target/hive-solr-0.0.1-SNAPSHOT-jar-with-dependencies.jar $HIVE_HOME/lib

SOLR example

Let’s use the SOLR schema provided in a previous post at:

$ git clone http://github.com/chimpler/blog-solr-cloud-example

Copy the schema from the directory /etc/solr/core0 to your solr configuration and restart your webserver:

<?xml version="1.0" ?>
<schema name="segment_overlap" version="1.1">
  <types>
   <fieldtype name="string"  class="solr.StrField" sortMissingLast="true"
              omitNorms="true"/>
   <fieldType name="int" class="solr.TrieIntField" precisionStep="0"
              positionIncrementGap="0"/>
   <fieldType name="float" class="solr.TrieFloatField" precisionStep="0"
              positionIncrementGap="0"/>
   <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
              positionIncrementGap="0"/>
   <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0"
              positionIncrementGap="0"/>
  </types>

 <fields>
   <field name="id" type="int" indexed="true" stored="true" required="true" />
   <field name="_version_" type="long" indexed="true" stored="true"
          required="true" />
   <field name="item_id" type="int" indexed="true" stored="true" required="true" />
   <field name="name" type="string" indexed="true" stored="true" required="true" />
   <field name="year" type="int" indexed="true" stored="true" required="true" />
   <field name="month" type="int" indexed="true" stored="true" required="true" />
   <field name="shipping_method" type="int" indexed="true" stored="true"
          required="true" />
   <field name="us_sold" type="int" indexed="true" stored="true" required="true" />
   <field name="ca_sold" type="int" indexed="true" stored="true" required="true" />
   <field name="fr_sold" type="int" indexed="true" stored="true" required="true" />
   <field name="uk_sold" type="int" indexed="true" stored="true" required="true" />
 </fields>

 <!-- field to use to determine and enforce document uniqueness. -->
 <uniqueKey>id</uniqueKey>

 <!-- field for the QueryParser to use when an explicit fieldname is absent -->
 <defaultSearchField>name</defaultSearchField>

 <!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
 <solrQueryParser defaultOperator="AND"/>
</schema>

Now let’s generate some data that we want to import in SOLR through Hive.

In the directory example:

$ python gen-core0-data.py | tail -n +2 > data.csv

We suppose that the core URL is: http://localhost:8983/solr/core0

Let’s import the data into Hive:

$ hive
hive> create table source (
    id INT,
    item_id INT,
    name STRING,
    year INT,
    month INT,
    shipping_method INT,
    us_sold INT,
    ca_sold INT,
    fr_sold INT,
    uk_sold INT)
    ROW FORMAT
    DELIMITED FIELDS TERMINATED BY ',';
OK

hive> LOAD DATA LOCAL INPATH 'data.csv' OVERWRITE INTO TABLE source;

Let’s create a SOLR external table solr_items and do a copy from the table source to solr_items:

hive> create external table solr_items (
    id INT,
    item_id INT,
    name STRING,
    year INT,
    month INT,
    shipping_method INT,
    us_sold INT,
    ca_sold INT,
    fr_sold INT,
    uk_sold INT
) stored by "com.chimpler.hive.solr.SolrStorageHandler"
with serdeproperties ("solr.column.mapping"="id,item_id,name,year,month,shipping_method,us_sold,ca_sold,fr_sold,uk_sold")
tblproperties ("solr.url" = "http://localhost:8983/solr/core0");
OK

hive> insert into table solr_items select * from source;
OK

You can optionally specify the buffer size to use when data is loaded or stored in SOLR by passing the following parameters in tblproperties:

  • solr.buffer.input.rows: number of rows to load in bulk when loading rows from SOLR. For instance: “solr.buffer.input.rows”=”10000” (default is 100,000)
  • solr.buffer.output.rows: number of rows to store in bulk when storing rows into SOLR. For instance: “solr.buffer.output.rows”=”10000” (default is 100,000)

Now let’s check if the data is in SOLR:

$ curl http://localhost:8983/solr/core0/select?wt=csv -d 'q=*:*'
id,_version_,shipping_method,fr_sold,uk_sold,ca_sold,item_id,name,us_sold,month,year
-48009990,1428822836035715072,0,1246,5050,796,0,item 0,376,1,1990
-48009989,1428822836035715073,1,6691,5718,2874,0,item 0,2029,1,1990
-48009988,1428822836035715074,2,5816,5114,5387,0,item 0,4927,1,1990
-48009987,1428822836035715075,3,8234,278,7206,0,item 0,1770,1,1990
Advertisements

About chimpler
http://www.chimpler.com

10 Responses to Playing with Apache Hive and SOLR

  1. Pankaj Khattar says:

    Hi

    This is a good information regarding the integration of Hive & SOLR.

    I have a quick question as i have not gone through the code which is there in git.
    The SOLR storage handler which you have mentioned here, Is it creating lucene based index files on the SOLR server as soon as the data is loaded in the hive external table or the index files are created in HDFS/Hive itself & SOLR webapp access them through HDFS/Hive external table?

    Can you please define the data flow?

    Thanks.

    • chimpler says:

      Hi Pankaj. The SOLR storage handler will store the data in SOLR as soon as you insert data in the hive external table. In our case, SOLR doesn’t have to be aware at all of Hive/HDFS.

      More precisely, when you insert data in the external table in Hive, say 1 million rows, we break them into small chunks of 100K rows and send them to SOLR. At the end of the insert, a commit is executed on SOLR to make it index the data.

  2. How the integrations works for non-stored fields (stored=”false”)? Is it possible to query the field but not modify it? …btw this is great!

    • chimpler says:

      That’s a good question. I would think that you would get null values for non-stored fields and won’t be able to query them. However you would be able to modify them so that they’re indexed in SOLR (then in SOLR you would be able to query them but not read their values).

  3. Bunny says:

    Hi, chimpler.
    I always used your excellent jar on hive 0.11 from last year.
    But after upgrading to hive 0.12, it can’t work!
    Do you have plan to upgrade your libs?

  4. Anupam says:

    Hi Chimpler,

    Your example gave me a quick start & initial understanding. Thanks!

    However when I am trying to utilize Hive Complex Data types like Array with your SolrStorageHandler Class I am facing errors. The basic example as per your blog works perfectly with SOLR.

    It seems that in my case the type of tags field in Hive is getting changed to String rather than ArrayList.

    I am basically trying to index a multivalued field in SOLR from a Hive table.
    Below is the schema definition from SOLR which is a multivalued field.

    Pls. see following log:

    create table JOINED3(key STRING, author STRING, title STRING, tagname array)
    row format delimited
    fields terminated by ‘,’
    collection items terminated by ‘$’;

    hive> describe JOINED3;
    OK
    key string
    author string
    title string
    tagname array
    Time taken: 0.155 seconds

    create table JOINED5 (key STRING, author STRING, title STRING, tags ARRAY)
    stored by “com.chimpler.hive.solr.SolrStorageHandler”
    with serdeproperties (“solr.column.mapping”=”key,author,title,tags”)
    tblproperties (“solr.url” = “http://localhost:8983/solr/collection”);

    hive> describe JOINED5;
    OK
    key string from deserializer
    author string from deserializer
    title string from deserializer
    tags string from deserializer
    Time taken: 0.125 seconds

    hive> insert into table JOINED5 select * from JOINED3;
    FAILED: NoMatchingMethodException No matching method for class org.apache.hadoop.hive.ql.udf.UDFToString with (array). Possible choices: _FUNC_(bigint) _FUNC_(binary) _FUNC_(boolean) _FUNC_(decimal) _FUNC_(double) _FUNC_(float) _FUNC_(int) _FUNC_(smallint) _FUNC_(string) _FUNC_(timestamp) _FUNC_(tinyint) _FUNC_(void)

    I will be grateful if you can help to explain where I am going wrong.

    Regards
    Anupam

  5. Mahesh says:

    Hi Chimpler,

    i got the following error:
    Error: java.lang.RuntimeException: org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: com.chimpler.hive.solr.SolrInputFormat

    Thanks

    • Pierre-Yves Dupont says:

      Hi,

      The same exception occurs when I try to fill solr_items with the following command :
      insert into table solr_items select * from source;

      I succeded in :
      – package jar with maven and move it in hive libs (I unzip it in order to know if SolrInputFormat class was present)
      – create Solr Core (I see it in solr web IHM)
      – create source and solr_items tables
      – fill source table (data are well returned when executing a select)

      When executing insert, mapreduce job is launched but fails in reduce part.

      There’s the trace :

      Total MapReduce jobs = 1
      Launching Job 1 out of 1
      Number of reduce tasks is set to 0 since there’s no reduce operator
      Starting Job = job_1409671122713_0009, Tracking URL = http://host1:8088/proxy/application_1409671122713_0009/
      Kill Command = /opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/hadoop/bin/hadoop job -kill job_1409671122713_0009
      Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
      2014-09-04 14:30:35,430 Stage-0 map = 0%, reduce = 0%
      2014-09-04 14:31:00,217 Stage-0 map = 100%, reduce = 0%
      Ended Job = job_1409671122713_0009 with errors
      Error during job, obtaining debugging information…
      Examining task ID: task_1409671122713_0009_m_000000 (and more) from job job_1409671122713_0009

      Task with the most failures(4):
      —–
      Task ID:
      task_1409671122713_0009_m_000000

      URL:
      http://host1:8088/taskdetails.jsp?jobid=job_1409671122713_0009&tipid=task_1409671122713_0009_m_000000
      —–
      Diagnostic Messages for this Task:
      Error: java.lang.RuntimeException: org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: com.chimpler.hive.solr.SolrInputFormat
      Serialization trace:
      inputFileFormatClass (org.apache.hadoop.hive.ql.plan.TableDesc)
      tableInfo (org.apache.hadoop.hive.ql.plan.FileSinkDesc)
      conf (org.apache.hadoop.hive.ql.exec.FileSinkOperator)
      childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator)
      childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator)
      aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork)
      at org.apache.hadoop.hive.ql.exec.Utilities.getBaseWork(Utilities.java:320)
      at org.apache.hadoop.hive.ql.exec.Utilities.getMapWork(Utilities.java:259)
      at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:251)
      at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:378)
      at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:371)
      at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:556)
      at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:168)
      at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:409)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
      at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:415)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
      at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
      Caused by: org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: com.chimpler.hive.solr.SolrInputFormat
      Serialization trace:
      inputFileFormatClass (org.apache.hadoop.hive.ql.plan.TableDesc)
      tableInfo (org.apache.hadoop.hive.ql.plan.FileSinkDesc)
      conf (org.apache.hadoop.hive.ql.exec.FileSinkOperator)
      childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator)
      childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator)
      aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork)
      at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138)
      at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115)
      at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:656)
      at org.apache.hive.com.esotericsoftware.kryo.serializers.DefaultSerializers$ClassSerializer.read(DefaultSerializers.java:238)
      at org.apache.hive.com.esotericsoftware.kryo.serializers.DefaultSerializers$ClassSerializer.read(DefaultSerializers.java:226)
      at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:745)
      at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:113)
      at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
      at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
      at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
      at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
      at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
      at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
      at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
      at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
      at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112)
      at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
      at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
      at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
      at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
      at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
      at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112)
      at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
      at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
      at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
      at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
      at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
      at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:139)
      at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
      at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
      at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
      at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
      at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:672)
      at org.apache.hadoop.hive.ql.exec.Utilities.deserializeObjectByKryo(Utilities.java:829)
      at org.apache.hadoop.hive.ql.exec.Utilities.deserializePlan(Utilities.java:739)
      at org.apache.hadoop.hive.ql.exec.Utilities.deserializePlan(Utilities.java:752)
      at org.apache.hadoop.hive.ql.exec.Utilities.getBaseWork(Utilities.java:292)
      … 13 more
      Caused by: java.lang.ClassNotFoundException: com.chimpler.hive.solr.SolrInputFormat
      at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
      at java.lang.Class.forName0(Native Method)
      at java.lang.Class.forName(Class.java:270)
      at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:136)
      … 49 more

      FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
      MapReduce Jobs Launched:
      Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
      Total MapReduce CPU Time Spent: 0 msec

      • Pierre-Yves Dupont says:

        like Bunny, I use hive 0.12.

      • Pierre-Yves Dupont says:

        I finally solve my problem. Chimpler’s hive-solr.jar works fine on hive 0.12.
        I didn’t pasted the .jar in the right hive library directory.
        For weird reasons, hive throws a ClassNotFoundException on com.chimpler.hive.solr.SolrInputFormat when executing insert command and not when executing select and create statements.

        For Cloudera users like me, be aware to create in each instance concerned a directory in which you will copy chimpler’s hive-solr jar. Then in Hive Service configuration, fill “hive aux library directory” parameter with the path of the directories you’ve just created.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: