lucidworks / hive-solr Goto Github PK

View Code? Open in Web Editor NEW

46.0 117.0 37.0 87.7 MB

Code to index Hive tables to Solr and Solr indexes to Hive

License: Apache License 2.0

Java 94.01% JavaScript 5.99%

solr hive

hive-solr's People

Contributors

Stargazers

Watchers

hive-solr's Issues

Support hive2.2 ？

HI， for some reason i have to compile hive2.2 from github source codes, I have did it , Now i want to use hive-solr t process my datas, Is this support hive2.2 ? thans for your help.

Error when updating index

Hello,

We've got this error when updating the index. We hace defined the following Hive table and can query index data, but not updating it. Any suggestion?

CREATE EXTERNAL TABLE hive_solr_test_1_zk (id string, field1 string)
      STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler'
      LOCATION '/tmp/solr'
      TBLPROPERTIES('solr.zkhost' = 'quickstart.cloudera:2181/solr',
                    'solr.collection' = 'hivesolrtest1',
                    'solr.query' = '*:*');

16/05/02 10:14:44 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1461934007258_0005/
16/05/02 10:14:44 INFO exec.Task: Starting Job = job_1461934007258_0005, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1461934007258_0005/
16/05/02 10:14:44 INFO exec.Task: Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1461934007258_0005
16/05/02 10:15:00 INFO exec.Task: Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
16/05/02 10:15:00 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
16/05/02 10:15:00 INFO exec.Task: 2016-05-02 10:15:00,436 Stage-0 map = 0%,  reduce = 0%
16/05/02 10:15:44 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
16/05/02 10:15:44 INFO exec.Task: 2016-05-02 10:15:44,311 Stage-0 map = 100%,  reduce = 0%
16/05/02 10:15:45 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
16/05/02 10:15:45 ERROR exec.Task: Ended Job = job_1461934007258_0005 with errors
16/05/02 10:15:45 INFO impl.YarnClientImpl: Killed application application_1461934007258_0005
16/05/02 10:15:45 ERROR ql.Driver: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
16/05/02 10:15:45 INFO log.PerfLogger: </PERFLOG method=Driver.execute start=1462176883595 end=1462176945435 duration=61840 from=org.apache.hadoop.hive.ql.Driver>
16/05/02 10:15:45 INFO ql.Driver: MapReduce Jobs Launched: 
16/05/02 10:15:45 WARN mapreduce.Counters: Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead
16/05/02 10:15:45 INFO ql.Driver: Stage-Stage-0: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL
16/05/02 10:15:45 INFO ql.Driver: Total MapReduce CPU Time Spent: 0 msec
16/05/02 10:15:45 INFO log.PerfLogger: <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>
16/05/02 10:15:45 INFO ZooKeeperHiveLockManager:  about to release lock for ocean_tests/hive_solr_test_1_zk2
16/05/02 10:15:45 INFO ZooKeeperHiveLockManager:  about to release lock for ocean_tests/dwh_kunde
16/05/02 10:15:45 INFO ZooKeeperHiveLockManager:  about to release lock for ocean_tests
16/05/02 10:15:45 INFO log.PerfLogger: </PERFLOG method=releaseLocks start=1462176945444 end=1462176945452 duration=8 from=org.apache.hadoop.hive.ql.Driver>
16/05/02 10:15:45 ERROR operation.Operation: Error running hive query: 
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
    at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:147)
    at org.apache.hive.service.cli.operation.SQLOperation.access$000(SQLOperation.java:69)
    at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:200)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
    at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:502)
    at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:213)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

multiValued

Hi sorry to ask a question here in the Issues section, but I'm at a loss as to where else to ask.

Does the solr-hive-serde integration support solr multiValued solr field types? If so, is there a specific hive datatype that needs to be used, i.e. will setting up a hive datatype of array be converted to a properly configured multiValued field?

If not, is there a workaround?

Hive - Complex Types
Solr - Defining Fields

the row is mess

Hello , when i read data from solr , it shows below ,can you help to see what happend ?

Unsupported major.minor 52.0

Hi, i'm having the following problem:
When i use the .jar on hive with a jdk 1.8 the class com.lucidworks.hadoop.hive.LWStorageHandler send an error of minor.major, so, i changed my computer to use a jdk 1.8, than it works, but when i try to put some data in to the solr table, the class com/lucidworks/hadoop/hive/LWHiveInputFormat give me a error of Unsupported major.minor version 52.0, so either way some class will give me a error =/
I followed the readme step by step.

Something problems about running on hive2.2

I tested hive-solr on hive1.2.1 ,it is fine. But when testing in hive2.2 ,it shows that
" java.lang.NoClassDefFoundError: com/lucidworks/hadoop/hive/LWSerDe "

How can i solve these problems? please.

Provide compiled and packaged jars for release

Today there are just a source code copy.

One error when import data from Hive to Solr

When the num of fields is 34 then import successfully,once the num larger than 34 , i will meet exception of connection reset. (There are 2 clusters for hive and solr.) You know there is no build-in limit. The limit is going to be dictated by your hardware resources.

Sockect Read timeout exception while commiting to solr

I'm pushing 600 million records every day to solr using Hive. As data is huge, solr is responding slow.
Due to which i'm getting socket timeout exception. Read timeout.

Can anyone help me how to increase sockettimeout ?
is it configurable ?

Help is much appreciated. Thanks in advance.

Caused by: shaded.org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://{hostname}:8983/solr/{collection}
at shaded.org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:654)
at shaded.org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
at shaded.org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
at shaded.org.apache.solr.client.solrj.impl.LBHttpSolrClient.doRequest(LBHttpSolrClient.java:483)
at shaded.org.apache.solr.client.solrj.impl.LBHttpSolrClient.request(LBHttpSolrClient.java:413)
at shaded.org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1106)
at shaded.org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:886)
at shaded.org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:819)
at shaded.org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:194)
at shaded.org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:484)
at shaded.org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:501)
at com.lucidworks.hadoop.io.LucidWorksWriter.close(LucidWorksWriter.java:278)
... 23 more
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at shaded.org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
at shaded.org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
at shaded.org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282)
at shaded.org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
at shaded.org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
at shaded.org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
at shaded.org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
at shaded.org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165)
at shaded.org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
at shaded.org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
at shaded.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
at shaded.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
at shaded.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at shaded.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
at shaded.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at shaded.org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:542)

SemanticException Cannot find class "com.lucidworks.hadoop.hive.LWStorageHandler"

I am trying to do the following and I am getting "FAILED: SemanticException Cannot find class 'com.lucidworks.hadoop.hive.LWStorageHandler'"

create external table IF NOT EXISTS movies_solr2(movieId INT, title STRING, genres STRING) STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler' LOCATION '/user/deep4u/solr/movielens_movies/' TBLPROPERTIES('solr.zkhost' = 'data1.cloudwick.com:2181,data2.cloudwick.com.com:2181,data3.cloudwick.com:2181/solr','solr.collection' = 'movies','solr.query' = '*:*');

The jar I am using is solr-hive-serde-3.0.0.jar. Is this a known problem with 3.0.0 jars?

Any help would be greatly appreciated.

Collection is not getting created

I am new with solr, i tried you given example but 'gettingstarted' collection is not created. It would be great help if you can provide me one detailed example

Data is not importing from hive to solr

I am using hive 2.1.1 Bin and solr 7.0. now i am importing data from HDFS to hive table and its working fine. we are working on big data so need to indexing in data and upgrade performance so we are using solr. if i am importing data from hive to solr is not working.

hadoop version : 2.7
Hive version : 2.1.1
Solr version: 7.0
OS: cent os 7
Java version: 1.8.0_144

Please help me for this issue.

Thanks in advance,

unable to insert data into hive-solr external table due to Unable to create serializer "org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer" for class: com.lucidworks.hadoop.hive.LWHiveOutputFormat error

step 1) created hive table test

step 2 ) created external table as follows
CREATE EXTERNAL TABLE rams.solr_test (
id string,
application_date string,
first_name string,
middle_name string,
last_name string,
preferred_email string,
address string,
city string,
country string,
candidate_disposition_reason string,
candidate_disposition string,
tag_original_ats string,
hiring_manager string,
job_title string,
is_candidate_active string,
attachments string)
STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler'
LOCATION '/user/xxx/hiredesk'
TBLPROPERTIES('solr.server.url'='solr URL ',
'solr.zkhost' = 'zookeeper host name',
'solr.collection' = 'inc_hiredesk_docs',
'solr.query' = ':');

note : I created a solr collection ( 'inc_hiredesk_docs') and gave valid solr server url and zookeeper hostnames

step 3) added following jar files

add /opt/lucidworks-hdpsearch/hive/solr-hive-serde-2.2.1.jar

add jar /opt/lucidworks-hdpsearch/hive/solr-hive-serde-0.13-2.2.1.jar

step 4) trying to insert the record but getting the following erro

insert into rams.solr_test values("0001AA28-036A-40F8-A6E8-D0FA39044556","2013-11-12 22:07:16.510","Manish","test","Kumawat","[email protected]","test","Sikar","test","test","test","test","test","test","test","test")

Serialization trace:
outputFileFormatClass (org.apache.hadoop.hive.ql.plan.TableDesc)
tableInfo (org.apache.hadoop.hive.ql.plan.FileSinkDesc)
conf (org.apache.hadoop.hive.ql.exec.FileSinkOperator)
childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator)
childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator)
aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork))'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. Error caching map.xml: org.apache.hive.com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Unable to create serializer "org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer" for class: com.lucidworks.hadoop.hive.LWHiveOutputFormat
Serialization trace:
outputFileFormatClass (org.apache.hadoop.hive.ql.plan.TableDesc)
tableInfo (org.apache.hadoop.hive.ql.plan.FileSinkDesc)
conf (org.apache.hadoop.hive.ql.exec.FileSinkOperator)
childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator)
childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator)
aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork)

Indexing more than 250MLN from Hive to SolR

Hi all,

We are trying to index more than 250MLN rows from Hive table (ORC format) but we have noticed that the indexing is too slow.

We have 9 SolR nodes (9 shards and 2 replicas per shard) and we have set the maxIndexingThreads parameter to 128 and the ramBufferSizeMB one to 60MB.

While launching the INSERT INTO on the external table, where the hive-serde is used, the servers CPU is idle and the indexig througput is lower than 1MLN per hour.

Since the servers are idle how can we do it faster? We have a lot of CPUs and RAM but we are not able to use them for the indexing process.
Any suggested?
Can be useful to configure any parameters on the client side to use all the threads?
Thanks in advance.

PS: We have set the commit (auto and soft) to 10 minutes or 1MLN of documents.

Not Able to build

I am not to build this project as per instruction in readme, getting following exceptions:

C:\work\solr\hive-solr-master\solr-hive-core\src\main\java\com\lucidworks\hadoop\hive\LWHiveInputFormat.java:3: _error: package com.lucidworks.hadoop.io does not exist_
import com.lucidworks.hadoop.io.LWMapRedInputFormat;
^

How to resolve "com.lucidworks.hadoop.io does not exist**** ", Please help here

I am new with gradle

Build failure with Compilation errors

Java version: java version "1.8.0_74"

When running this command, './gradlew clean shadowJar --info' i see below error:

18:01:13.142 [ERROR] [system.err] return LWDocumentWritable.class;
18:01:13.142 [ERROR] [system.err] ^
18:01:13.143 [ERROR] [system.err] symbol: class LWDocumentWritable
18:01:13.143 [ERROR] [system.err] location: class LWSerDe
18:01:13.143 [ERROR] [system.err] /Users/sramaiah/git_workspace/hive-solr/solr-hive-core/src/main/java/com/lucidworks/hadoop/hive/LWSerDe.java:108: error: cannot find symbol
18:01:13.143 [ERROR] [system.err] LWDocument doc = LWDocumentProvider.createDocument();
18:01:13.143 [ERROR] [system.err] ^
18:01:13.144 [ERROR] [system.err] symbol: class LWDocument
18:01:13.144 [ERROR] [system.err] location: class LWSerDe
18:01:13.144 [ERROR] [system.err] /Users/sramaiah/git_workspace/hive-solr/solr-hive-core/src/main/java/com/lucidworks/hadoop/hive/LWSerDe.java:108: error: cannot find symbol
18:01:13.144 [ERROR] [system.err] LWDocument doc = LWDocumentProvider.createDocument();
18:01:13.144 [ERROR] [system.err] ^
18:01:13.144 [ERROR] [system.err] symbol: variable LWDocumentProvider
18:01:13.144 [ERROR] [system.err] location: class LWSerDe
18:01:13.146 [ERROR] [system.err] /Users/sramaiah/git_workspace/hive-solr/solr-hive-core/src/main/java/com/lucidworks/hadoop/hive/LWSerDe.java:170: error: cannot find symbol
18:01:13.146 [ERROR] [system.err] return new LWDocumentWritable(doc);
18:01:13.146 [ERROR] [system.err] ^
18:01:13.146 [ERROR] [system.err] symbol: class LWDocumentWritable
18:01:13.146 [ERROR] [system.err] location: class LWSerDe
18:01:13.147 [ERROR] [system.err] /Users/sramaiah/git_workspace/hive-solr/solr-hive-core/src/main/java/com/lucidworks/hadoop/hive/LWStorageHandler.java:27: error: incompatible types: Class cannot be converted to Class<? extends InputFormat>
18:01:13.147 [ERROR] [system.err] return LWHiveInputFormat.class;
18:01:13.147 [ERROR] [system.err] ^
18:01:13.148 [ERROR] [system.err] 89 errors
18:01:13.148 [ERROR] [system.err] 10 warnings
18:01:13.150 [DEBUG] [org.gradle.api.internal.tasks.execution.ExecuteAtMostOnceTaskExecuter] Finished executing task ':solr-hive-core:compileJava'
18:01:13.150 [LIFECYCLE] [class org.gradle.internal.buildevents.TaskExecutionLogger] :solr-hive-core:compileJava FAILED
18:01:13.150 [INFO] [org.gradle.execution.taskgraph.AbstractTaskPlanExecutor] :solr-hive-core:compileJava (Thread[Daemon worker Thread 4,5,main]) completed. Took 0.928 secs.
18:01:13.150 [DEBUG] [org.gradle.internal.operations.DefaultBuildOperationWorkerRegistry] Worker root.13 completed (0 in use)
18:01:13.151 [DEBUG] [org.gradle.execution.taskgraph.AbstractTaskPlanExecutor] Task worker [Thread[Daemon worker Thread 4,5,main]] finished, busy: 0.956 secs, idle: 0.008 secs
18:01:13.151 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter]
18:01:13.151 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] FAILURE: Build failed with an exception.
18:01:13.151 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter]
18:01:13.151 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] * What went wrong:
18:01:13.151 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] Execution failed for task ':solr-hive-core:compileJava'.
18:01:13.151 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] > Compilation failed; see the compiler error output for details.
18:01:13.151 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter]
18:01:13.151 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] * Try:
18:01:13.151 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] Run with --stacktrace option to get the stack trace.
18:01:13.151 [LIFECYCLE] [org.gradle.internal.buildevents.BuildResultLogger]
18:01:13.152 [LIFECYCLE] [org.gradle.internal.buildevents.BuildResultLogger] BUILD FAILED
18:01:13.152 [LIFECYCLE] [org.gradle.internal.buildevents.BuildResultLogger]
18:01:13.152 [LIFECYCLE] [org.gradle.internal.buildevents.BuildResultLogger] Total time: 2.051 secs
18:01:13.152 [DEBUG] [org.gradle.cache.internal.btree.BTreePersistentIndexedCache] Closing cache fileSnapshots.bin (/Users/sramaiah/

Multivalued solr field support

Could you please show an example to create a hive table from an existing Solr index.
Thanks

Not being able to insert data in external Hive table connected to Solr

Hi all,

Hive Version 1.2.1000
Solr Version 5.5.2.2.5
Hadoop Version 2.5.3.0
solr-hive-serde-3.0.0.jar

I wanted to implement an external hive table connected to solr.

I built the JAR file on my local pc and uploaded it to HDFS. After that I connected to our cluster using putty, then connected to Hive using Beeline.

After being successfully connected I loaded the jar file

add jar hdfs:///user/hbx/solr-hive-serde-3.0.0.jar;

Then I create an empty external table following the example from the documentation:

CREATE EXTERNAL TABLE solr_test (city STRING, edition INT, sport STRING, 
sub_sport STRING, athlete STRING, country STRING, gender STRING, event 
STRING, event_gender STRING, medal STRING) STORED BY 
'com.lucidworks.hadoop.hive.LWStorageHandler' LOCATION '/tmp/solr' 
TBLPROPERTIES('solr.server.url' = 'https://hdpmst003.example.hbx:8080/', 
'solr.collection' = 'collection1', 'solr.query' = '*:*', 'lww.jaas.file' = 
'/etc/solr/conf/solr-server-jaas.conf');

I ensured that lww.jaas.file is present in the same location on all active nodes per the following doc:

"The JAAS configuration file must be copied to the same path on every node where a Node Manager is running (i.e., every node where map/reduce tasks are executed)"

After creating the table without data and without error messages I try to insert data from another table into this table. Here the code breaks down and we get various error messages.

The errors start during the mapping tasks, the start reads:

0: jdbc:hive2://hdpmst001.example.hbx:8080> INSERT OVERWRITE TABLE solr_test 
SELECT * FROM solr_olympics; 
INFO : Session is already open 
INFO : Dag name: INSERT OVERWRITE TABLE solr_...solr_olympics(Stage-1) 
INFO : 

INFO : Status: Running (Executing on YARN cluster with App id 
application_1492773261214_8070) 

INFO : Map 1: -/- 
INFO : Map 1: 0/1 
INFO : Map 1: 0(+1)/1 
INFO : Map 1: 0(+1,-1)/1 
INFO : Map 1: 0(+1,-1)/1 
INFO : Map 1: 0(+1,-2)/1 
INFO : Map 1: 0(+1,-3)/1 
INFO : Map 1: 0(+1,-3)/1 
INFO : Map 1: 0(+1,-3)/1 
ERROR : Status: Failed 
ERROR : Vertex failed, vertexName=Map 1, 

vertexId=vertex_1492773261214_8070_5_00, diagnostics=[Task failed, 
taskId=task_1492773261214_8070_5_00_000000, diagnostics=[TaskAttempt 0 
failed, info=[Error: Failure while running task:java.lang.RuntimeException: 
java.lang.RuntimeException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
processing row

The end of the error message reads:

Caused by: java.lang.NullPointerException 
at com.lucidworks.hadoop.io.impl.LWSolrDocument.getId(LWSolrDocument.java:46) 
at com.lucidworks.hadoop.io.LucidWorksWriter.write(LucidWorksWriter.java:195) 
... 27 more 
]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:0, Vertex vertex_1492773261214_8070_6_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 (state=08S01,code=2)

Any help in finding a thread to pull on would be greatly appreciated.

Solr Index on Hive - Loading data into External table fails

Hello,
I created a Solr Index on a Hive table with below steps. When I try to load rows from the Hive Internal table to the Hive External table, it fails. Pls help.

CREATE TABLE ER_ENTITY1000(entityid INT,claimid_s INT,firstname_s STRING,lastname_s STRING,addrline1_s STRING, addrline2_s STRING, city_s STRING, state_S STRING, country_s STRING, zipcode_s STRING, dob_s STRING, ssn_s STRING, dl_num_s STRING, proflic_s STRING, policynum_s STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA LOCAL INPATH '/home/Solr1.csv' OVERWRITE INTO TABLE ER_ENTITY1;
add jar /home/solr-hive-serde-3.0.0.jar;

CREATE EXTERNAL TABLE SOLR_ENTITY999(entityid INT,claimid_s INT,firstname_s STRING,lastname_s STRING,ssn_s STRING,dl_num_s STRING,city_s STRING,state_s STRING,country_s STRING,zipcode_s STRING)
> STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler'
> LOCATION '/user/i98779/SOLR_ENTITY1'
> TBLPROPERTIES('solr.server.url' = 'http://10.52.192.108:8983/solr','solr.collection' = 'er_entity','solr.query' = ':');

********** All above steps work fine **********

********** This step fails **********
INSERT OVERWRITE TABLE SOLR_ENTITY999 SELECT * FROM ER_ENTITY1000;

... With error:
hive> INSERT OVERWRITE TABLE SOLR_ENTITY999 SELECT * FROM ER_ENTITY1000;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = i98779_20180308085142_3918b9ea-2158-4b0e-865f-2fcdefc17e4b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2018-03-08 08:51:45,993 Stage-1 map = 0%, reduce = 0%
Ended Job = job_local1283927429_0001 with errors
Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: MAPRFS Read: 0 MAPRFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec

********** ERROR FROM HIVE JOB LOG is as below **********
java.lang.Exception: Unknown container. Container either has not started or has already completed or doesn't belong to this node at all.

Support case insensitive fieldnames from solr

Today mixed case field return null value instead.
For example:

<field name="id" type="string" indexed="true" stored="true"  />
<field name="clientId" type="string" indexed="true" stored="true"  required="true"/>
<field name="ownerId" type="string" indexed="true" stored="true" required="true"/>

Using the following query:

SELECT id, clientId, ownerId from my_solr_table limit 1

will return the values:

abcefg
null
null

Cannot query indexes with null field values

I can select into my hive-solr table from a source table to create the solr index, however when I query the resulting table, I receive an error on columns that were null on input:

If select *:

Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double

If select up to the offending column:

Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException

Error loading into solr table from another hive table.

>sudo -u solr bin/solr create -c hiveCollection -d basic_configs -n hiveCollection -s 2 -rf 2
>hive>CREATE EXTERNAL TABLE authproc_syslog_solr (hid STRING, tstamp TIMESTAMP, type STRING, msg STRING, thost STRING, tservice STRING, tyear STRING, tmonth STRING, tday STRING) STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler' LOCATION '/tmp/solr' TBLPROPERTIES('solr.zkhost' = 'hadoop1.openstacksetup.com:2181/solr', 'solr.collection'='hiveCollection', 'solr.query' = '*:*');

>hive>INSERT OVERWRITE TABLE authproc_syslog_solr SELECT s.* FROM authproc_syslog s;

Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:32, Vertex vertex_1473357519389_0194_6_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0

DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_14733575
19389_0194_6_00, diagnostics=[Task failed, taskId=task_1473357519389_0194_6_00_000009, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure wh
ile running task:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error whi
le processing row
Caused by: java.lang.NullPointerException
        at com.lucidworks.hadoop.io.impl.LWSolrDocument.getId(LWSolrDocument.java:46)
        at com.lucidworks.hadoop.io.LucidWorksWriter.write(LucidWorksWriter.java:184)
        at com.lucidworks.hadoop.hive.LWHiveOutputFormat$1.write(LWHiveOutputFormat.java:39)
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:764)
        at org.apache.hadoop.hive.ql.exec.vector.VectorFileSinkOperator.process(VectorFileSinkOperator.java:102)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838)
        at org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:138)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838)
        at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:133)
        at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:170)
        at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:45)

The hive table and the hive_solr table have the exactly same schema.

Support for Solr basic authentication

Hi all,

I am trying to use the hive_1x branch and have built solr-hive-serde-3.0.0.jar based on your instructions. I was also able to add this jar in my Hive runtime and create external table based on the docs here: https://doc.lucidworks.com/fusion/2.4/Importing_Data/Import-via-Hive.html.

However, when I tried to insert data into the Hive-Solr table, I am getting the followings:

Caused by: shaded.org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at https://example.com:8983/solr/hive_solr_test: Expected mime type application/octet-stream but got text/html. <html>

<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 401 require authentication</title>
</head>
<body><h2>HTTP ERROR 401</h2>
<p>Problem accessing /solr/hive_solr_test/update. Reason:
<pre>    require authentication</pre></p>
</body>
</html>

Obviously this is due to improper authentication. But throughout the readme file I only saw instructions on how to authenticate through Kerberos, while our Solr cloud enforces basic username/password authentication. Could you guys provide some insights on how I can apply basic authentication in hive-solr driver?

Thanks,
Vincent

Using Tez to index data from Hive to Solr gives exceptions

I am trying to index large amounts of data from hive to solr and I would like to use Tez, as it is faster. When I run a query to insert data into the Solr managed external table, it works fine but it takes a very long time and eventually times out because of hive.

Trying to use Tez, Tez gives me errors when I try to do anything with the Solr managed table, like select count(*) or insert records in it. I thought it was a file permission issue so I changed the permissions on the solr collection folder so that anyone in the group where solr use is would be able to insert into the table and that also did not work.

Is this a known issue with the hive-solr plugin? or I am missing something?

Thanks for your help.

Cannot query timestamp type after indexed

Source hive table has columns of type timestamp. I mapped this to hive-solr column type of timestamp and schema field type tdate (which has precision 6). Upon querying, received error:

Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: java.util.Date cannot be cast to java.sql.Timestamp

Value is stored as TrieDateField with correct precision. Shouldn't the SerDe treat this as a timestamp (and precision 0 as a date)?

Not able to Build the Jar using Java 1.7

My cluster is build on Java 1.7 after exporting the JAVA HOME i am not able to build the jar file failing with below exception Please let me know if it is compatible with Java 1.7

Error:
Received result Failure[value=org.gradle.initialization.ReportedException: org.gradle.internal.exceptions.LocationAwareException: Execution failed for task ':solr-hadoop-common:solr-hadoop-io:compileJava'.] from daemon DaemonInfo{pid=10063, address=[c6420d07-8233-4272-9515-ea571d8e1e75 port:40778, addresses:[/127.0.0.1]], idle=false, lastBusy=1501864598134, context=DefaultDaemonContext[uid=d6b0e1b2-e9d5-4ade-bae6-64c2c1c46f0b,javaHome=/local/java/jdk1.7.0_80,daemonRegistryDir=/home/admin_toopras/.gradle/daemon,pid=10063,idleTimeout=10800000,daemonOpts=-XX:MaxPermSize=256m,-XX:+HeapDumpOnOutOfMemoryError,-Xmx1024m,-Dfile.encoding=UTF-8,-Duser.country=US,-Duser.language=en,-Duser.variant]} (build should be done).
hive-solr.txt

Does hive-solr support Hive-1.2 version?

This is awesome project. Does hive-solr support Hive-1.2 version? It's not working for me.

Please let me know. Thanks!

Dependency conflict with newer Hive versions/Serde 2.3

Hey Guys this is a problem and a correction.

Just faced an issue regarding newer versions of Hive/Serde(Specially on AWS EMR).

org.apache.hadoop.hive.serde2.SerDe Interface is now deprecated and is no longer present in hive-serde jar (As of version 2.3). The fix is easy, just a matter of changing org.apache.hadoop.hive.serde2.SerDe to org.apache.hadoop.hive.serde2.AbstractSerDe on imports
of FusionStorageHandler.java and LWStorageHandler.java. And to extend it on LWSerDe.java

Then recompile.

Cannot retrieve data from a solr indexed external table

I am able to load data into solr external table from another managed hive table.
I can see the data got indexed and do a serach on it using HUE interface.
But when I try to retrieve data from the solr table, it is throwing
"Failed with exception java.io.IOException:java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String".
I am using solr-hive-serde-2.2.6.jar (using com.lucidworks.hadoop.hive.LWStorageHandler) on Hive 1.1.0-cdh5.4.5

Make the name of id field configurable

It can be pretty easy to support configuring the name of the id field in the given Solr collection, and can be useful. The pull request submitted by saurabhnigam is a good place to start.

No proxy in TBLPROPERTIES

assuming that the SerDe uses somethimg similar to

curl "http://destinationserver.domain:port/solr/core/update?commit=true"

to insert the Hive data into Solr, the following is needed in my case:

TBLPROPERTIES
(
'solr.server.proxycommand' = '--noproxy destinationserver.domain'
)

resulting in:
curl "http://destinationserver.domain:port/solr/core/update?commit=true" --noproxy destinationserver.domain

Error restarting Solr after create the collection to index hive table

I followed your example to create the Solr table and collection, but now when I try to restart Solr server I get this error.

java.io.IOException: Path /conf does not exist
at org.apache.solr.common.cloud.ZkConfigManager.uploadToZK(ZkConfigManager.java:56)
at org.apache.solr.common.cloud.ZkConfigManager.uploadConfigDir(ZkConfigManager.java:120)
at org.apache.solr.cloud.ZkController.bootstrapConf(ZkController.java:1762)
at org.apache.solr.cloud.ZkCLI.main(ZkCLI.java:194)

Seems like creating the collection+ table does not create the conf folder automatically.

Query params not pushed to solr

I have followed the readme and created an external table exactly as described plus loaded my data into solr. I can query my data collection using the browser q=item_name:apple and it returns as expected.

However when I run the same query from hive:

select * from item_solr where item_name = 'apple'

it returns nothing. Is the syntax correct? how do you get hive-solr to tell solr to use q=item_name:apple?

I also tried this query which returns all rows that matches the wildcard:
select * from item_solr WHERE lower(item_name) RLIKE '.*apple.*';

but I can see in the solr logs that hive-solr doesn't succeed in pushing the query params:
2017-08-18 00:33:40.934 INFO (qtp401424608-2599) [ x:item_wild] o.a.s.c.S.Request [item_wild] webapp=/solr path=/select params={q=*:*&distrib=false&cursorMark=AoEvMjA0MjI3LTEwNjI2NTUw&start=0&collection=item_wild&sort=id+asc&rows=1000&wt=javabin&version=2} hits=422407 status=0 QTime=19

I assumed that hive-solr picks up all the predicates in the WHERE clause when query=*:* is specified and puts it in the solr query, am I mistaken?
Similar to this:
https://github.com/vroyer/hive-solr-search

My table definition:
hive> show create table item_solr; OK CREATE EXTERNAL TABLE item_solr( idstring COMMENT 'from deserializer',item_name string COMMENT 'from deserializer') ROW FORMAT SERDE 'com.lucidworks.hadoop.hive.LWSerDe' STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler' WITH SERDEPROPERTIES ( 'serialization.format'='1') LOCATION 'hdfs://nameservice1/user/solr' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='false', 'lww.jaas.file'='/home/user/jaas-client.conf', 'numFiles'='4', 'numRows'='-1', 'rawDataSize'='-1', 'solr.collection'='item_wild', 'solr.query'='*:*', 'solr.server.url'='http://localhost:11119/solr', 'totalSize'='717879', 'transient_lastDdlTime'='1496303975')

thank you

Make the id field not required in the Solr external table since I have Solr auto generate it

I would like the Id to be auto generated by Solr everytime I insert a document into the collection. I created the collection to do so but the Solr external table is not allowing me to insert data without providing the id field.

Is there a way to create the Solr table without the id field and let Solr auto assign it to the documents?

Thanks

HiveServer2 cannot fetch Ranger policy updates, when solr-hive-serde is added as UDF

solr-hive-serde has jersey-client packaged into it, which is not used by the serde, but will be added to the classpath, and this causes issues for HiveServer2 and Ranger.

I have created a pull request to resolve this issue:
#65

Please, review it and let me know if it can be accepted as a solution to the problem.

Indexing to a TRA (Time Routed Alias) is not supported?

I've been using hive-solr to index my data to Solr collections very well. Last week I wanted to try Time Routed Alias feature of Solr, which behaved as expected when I manually indexed a few test documents. I prepared some data to index into a TRA but hive-solr didn't seem to see the alias. It behaved like I was indexing into a non-existent collection. Does hive-solr support the TRA feature at all?

Error accesing solr with kerber

Hi,
We had created a external table accesing solr with kerberos active with this script

CREATE EXTERNAL TABLE job_solr (gms STRING, cliente STRING, job STRING,job_descripcion STRING)
STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler'
LOCATION '/apps/hive/warehouse/datos/jobs_solr'
TBLPROPERTIES('solr.zkhost' = 'xxxxxxxx:2181/solr',
'solr.collection' = 'MSSearch1',
'solr.query' = ':',
'lww.jaas.file' = '/app/lucicworks/hive-solr/hive_solr_jaas.conf',
'lww.jaas.appname' = 'Client');

When we try to access the table from hive raise an error as it was not able to authenticate with kerberos

org.apache.hive.service.cli.HiveSQLException: java.io.IOException: shaded.org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://xxxxxxxxxxxxxxx:8983/solr/MSSearch1_shard1_replica2: Expected mime type application/octet-stream but got text/html.

<title>Error 401 Authentication required</title>

HTTP ERROR 401

Problem accessing /solr/M..

What could be the problem?
Regards

lucidworks / hive-solr Goto Github PK

hive-solr's People

Contributors

Stargazers

Watchers

Forkers

hive-solr's Issues

HTTP ERROR 401

Recommend Projects

Recommend Topics

Recommend Org