aaronshan / hive-third-functions Goto Github PK

Some useful custom hive udf functions, especial array, json, math, string functions.

License: Apache License 2.0

Java 100.00%

hive-third-functions's Issues

What is the syntax for json path?

create table temp.test_explode (userid string, log string) partitioned by (day string) stored as orc;
insert into table temp.test_explode partition (day = '2017-11-01') values ('u1', '[{  "action": "a" },  {  
"action": "b"} ]');
insert into table temp.test_explode partition (day = '2017-11-02') values ('u2', '[{  "action": "a" , "arg1": "1"},  {  "action": "b", "arg2": "2"} ]');


create temporary function udf_json_array_extract as 'cc.shanruifeng.functions.json.UDFJsonArrayExtract';

create temporary function udf_json_array_extract_scalar as 'cc.shanruifeng.functions.json.UDFJsonExtractScalar';


select userid, udf_json_array_extract ( log, '$.action' ) from  temp.test_explode  where day = '2017-11-01';

-- result
u1	["\"a\"","\"b\""]

select userid, udf_json_array_extract_scalar ( log, '$.action' ) from  temp.test_explode  where day = '2017-11-01';

-- result
u1	NULL

I want to get an array like ["a","b"], in the above second query, I got NULL.

Am I using wrong syntax for json path? What is the syntax for the path?

Install on remote

Hello,

How to install on remote hadoop nodes
from client ?

0: jdbc:hive2://XX> create temporary function wgs_distance as 'com.github.aaronshan.functions.geo.UDFGeoWgsDistance'; ERROR : FAILED: Class com.github.aaronshan.functions.geo.UDFGeoWgsDistance does not implement UDF, GenericUDF, or UDAF Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask (state=08S01,code=1) 0: jdbc:hive2://XX> create temporary function gcj_to_bd as 'com.github.aaronshan.functions.geo.UDFGeoGcjToBd'; ERROR : FAILED: Class com.github.aaronshan.functions.geo.UDFGeoGcjToBd does not implement UDF, GenericUDF, or UDAF Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask (state=08S01,code=1) 0: jdbc:hive2://XX> create temporary function bd_to_gcj as 'com.github.aaronshan.functions.geo.UDFGeoBdToGcj'; No rows affected (0.014 seconds) 0: jdbc:hive2://XX> create temporary function wgs_to_gcj as 'com.github.aaronshan.functions.geo.UDFGeoWgsToGcj'; ERROR : FAILED: Class com.github.aaronshan.functions.geo.UDFGeoWgsToGcj does not implement UDF, GenericUDF, or UDAF Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask (state=08S01,code=1) 0: jdbc:hive2://XX> create temporary function gcj_to_wgs as 'com.github.aaronshan.functions.geo.UDFGeoGcjToWgs'; ERROR : FAILED: Class com.github.aaronshan.functions.geo.UDFGeoGcjToWgs does not implement UDF, GenericUDF, or UDAF Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask (state=08S01,code=1) 0: jdbc:hive2://XX> create temporary function gcj_extract_wgs as 'com.github.aaronshan.functions.geo.UDFGeoGcjExtractWgs'; ERROR : FAILED: Class com.github.aaronshan.functions.geo.UDFGeoGcjExtractWgs does not implement UDF, GenericUDF, or UDAF Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask (state=08S01,code=1)

the jar is come from your released 2.2.0 , is there some problems ?

UDFRe2JRegexpExtractAll参数个数判断和函数返回值不对

参数判断
if (arguments.length != 2 || arguments.length != 3)
改成
if (arguments.length != 2 && arguments.length != 3)

返回值
list.add(searchedGroup.toString());
改成
list.add(searchedGroup.toStringUtf8());

What is the difference between json_extract_scalar() and get_json_object()？

From the README, the two functions are not very different. What other advantages does json_extract_scalar() have?

UDFArrayIntersect代码逻辑有问题，会出现数组越界异常或者出现错误返回值。

https://github.com/aaronshan/hive-third-functions/blame/f98fef86d328882c85ea40b69b14375d90d44201/src/main/java/com/github/aaronshan/functions/array/UDFArrayIntersect.java#L160
select default.array_intersect(array("39236600","38943350","39007633"),array("39236600","38943350","39007633","38593565","39165420","39119191","39223090","39273131","39113697","39264583","38643724","39243639","39273301","39153039","39152750","38422867","39194210"));
返回值应该是{"39236600","38943350","39007633"}，但实际上只返回了一个。

查看了代码，发现compare方法逻辑错误，应修复为一下：
private int compare(ListObjectInspector arrayOI, Object array, int[] positions, int position1, int position2) {
ObjectInspector arrayElementOI = arrayOI.getListElementObjectInspector();
Object arrayElementTmp1 = arrayOI.getListElement(array, positions[position1]);
Object arrayElementTmp2 = arrayOI.getListElement(array, positions[position2]);
return ObjectInspectorUtils.compare(arrayElementTmp1, arrayElementOI, arrayElementTmp2, arrayElementOI);
}
即参数中增加int[] positions，传递进来leftPositions或者rightPositions，相应方法调用出也需一并修改。

array_intersect bug

array_intersect 有数组越界bug，代码也复杂，简化代码：
` @OverRide
public Object evaluate(DeferredObject[] arguments) throws HiveException {
Object leftArray = arguments[0].get();
Object rightArray = arguments[1].get();

    int leftArrayLength = leftArrayOI.getListLength(leftArray);
    int rightArrayLength = rightArrayOI.getListLength(rightArray);

    // Check if array is null or empty
    if (leftArray == null || rightArray == null || leftArrayLength < 0 || rightArrayLength < 0) {
        return null;
    }

    if (leftArrayLength == 0) {
        return leftArray;
    }
    if (rightArrayLength == 0) {
        return rightArray;
    }

    List<?> leftList = leftArrayOI.getList(leftArray);
    List<?> rightList = leftArrayOI.getList(rightArray);
    HashSet<?> result_set = Sets.newHashSet(leftList);
    result_set.retainAll(rightList);

    return new ArrayList(result_set);
}`

[BUG]when apply the udf, appears null pointer error

Hello,

I tried to apply these UDFs in spark with hive support. Here is the code:

// register UDF
spark.sql("create temporary function id_card_province as 'cc.shanruifeng.functions.card.UDFChinaIdCardProvince'");
        	
// get file
Dataset<Row> rawdata = spark.read().csv("./src/main/resources/starM.csv");

// use UDF
rawdata.createOrReplaceTempView("starM");
Dataset<Row> udfModified = spark.sql("SELECT *, id_card_province(_c13) FROM starM");
udfModified.show();

and I got error:

org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public org.apache.hadoop.io.Text cc.shanruifeng.functions.card.UDFChinaIdCardProvince.evaluate(org.apache.hadoop.io.Text)  on object cc.shanruifeng.functions.card.UDFChinaIdCardProvince@5c622859 of class cc.shanruifeng.functions.card.UDFChinaIdCardProvince with arguments {652423184510291234:org.apache.hadoop.io.Text} of size 1
	at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:981)
	at org.apache.spark.sql.hive.HiveSimpleUDF.eval(hiveUDFs.scala:91)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_6$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:957)
	... 18 more
Caused by: java.lang.NullPointerException
	at org.apache.hadoop.io.Text.encode(Text.java:450)
	at org.apache.hadoop.io.Text.set(Text.java:198)
	at cc.shanruifeng.functions.card.UDFChinaIdCardProvince.evaluate(UDFChinaIdCardProvince.java:23)
	... 23 more

Could you please give me some advice on this problem? The column is not null anyway.

aaronshan / hive-third-functions Goto Github PK

hive-third-functions's Issues

What is the syntax for json path?

Install on remote

some function cannot work

UDFRe2JRegexpExtractAll参数个数判断和函数返回值不对

What is the difference between json_extract_scalar() and get_json_object()？

UDFArrayIntersect代码逻辑有问题，会出现数组越界异常或者出现错误返回值。

array_intersect bug

[BUG]when apply the udf, appears null pointer error

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent