aaronshan / hive-third-functions Goto Github PK
View Code? Open in Web Editor NEWSome useful custom hive udf functions, especial array, json, math, string functions.
License: Apache License 2.0
Some useful custom hive udf functions, especial array, json, math, string functions.
License: Apache License 2.0
create table temp.test_explode (userid string, log string) partitioned by (day string) stored as orc;
insert into table temp.test_explode partition (day = '2017-11-01') values ('u1', '[{ "action": "a" }, {
"action": "b"} ]');
insert into table temp.test_explode partition (day = '2017-11-02') values ('u2', '[{ "action": "a" , "arg1": "1"}, { "action": "b", "arg2": "2"} ]');
create temporary function udf_json_array_extract as 'cc.shanruifeng.functions.json.UDFJsonArrayExtract';
create temporary function udf_json_array_extract_scalar as 'cc.shanruifeng.functions.json.UDFJsonExtractScalar';
select userid, udf_json_array_extract ( log, '$.action' ) from temp.test_explode where day = '2017-11-01';
-- result
u1 ["\"a\"","\"b\""]
select userid, udf_json_array_extract_scalar ( log, '$.action' ) from temp.test_explode where day = '2017-11-01';
-- result
u1 NULL
I want to get an array like ["a","b"], in the above second query, I got NULL.
Am I using wrong syntax for json path? What is the syntax for the path?
Hello,
How to install on remote hadoop nodes
from client ?
0: jdbc:hive2://XX> create temporary function wgs_distance as 'com.github.aaronshan.functions.geo.UDFGeoWgsDistance'; ERROR : FAILED: Class com.github.aaronshan.functions.geo.UDFGeoWgsDistance does not implement UDF, GenericUDF, or UDAF Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask (state=08S01,code=1) 0: jdbc:hive2://XX> create temporary function gcj_to_bd as 'com.github.aaronshan.functions.geo.UDFGeoGcjToBd'; ERROR : FAILED: Class com.github.aaronshan.functions.geo.UDFGeoGcjToBd does not implement UDF, GenericUDF, or UDAF Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask (state=08S01,code=1) 0: jdbc:hive2://XX> create temporary function bd_to_gcj as 'com.github.aaronshan.functions.geo.UDFGeoBdToGcj'; No rows affected (0.014 seconds) 0: jdbc:hive2://XX> create temporary function wgs_to_gcj as 'com.github.aaronshan.functions.geo.UDFGeoWgsToGcj'; ERROR : FAILED: Class com.github.aaronshan.functions.geo.UDFGeoWgsToGcj does not implement UDF, GenericUDF, or UDAF Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask (state=08S01,code=1) 0: jdbc:hive2://XX> create temporary function gcj_to_wgs as 'com.github.aaronshan.functions.geo.UDFGeoGcjToWgs'; ERROR : FAILED: Class com.github.aaronshan.functions.geo.UDFGeoGcjToWgs does not implement UDF, GenericUDF, or UDAF Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask (state=08S01,code=1) 0: jdbc:hive2://XX> create temporary function gcj_extract_wgs as 'com.github.aaronshan.functions.geo.UDFGeoGcjExtractWgs'; ERROR : FAILED: Class com.github.aaronshan.functions.geo.UDFGeoGcjExtractWgs does not implement UDF, GenericUDF, or UDAF Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask (state=08S01,code=1)
the jar is come from your released 2.2.0 , is there some problems ?
参数判断
if (arguments.length != 2 || arguments.length != 3)
改成
if (arguments.length != 2 && arguments.length != 3)
返回值
list.add(searchedGroup.toString());
改成
list.add(searchedGroup.toStringUtf8());
From the README, the two functions are not very different. What other advantages does json_extract_scalar() have?
https://github.com/aaronshan/hive-third-functions/blame/f98fef86d328882c85ea40b69b14375d90d44201/src/main/java/com/github/aaronshan/functions/array/UDFArrayIntersect.java#L160
select default.array_intersect(array("39236600","38943350","39007633"),array("39236600","38943350","39007633","38593565","39165420","39119191","39223090","39273131","39113697","39264583","38643724","39243639","39273301","39153039","39152750","38422867","39194210"));
返回值应该是{"39236600","38943350","39007633"},但实际上只返回了一个。
查看了代码,发现compare方法逻辑错误,应修复为一下:
private int compare(ListObjectInspector arrayOI, Object array, int[] positions, int position1, int position2) {
ObjectInspector arrayElementOI = arrayOI.getListElementObjectInspector();
Object arrayElementTmp1 = arrayOI.getListElement(array, positions[position1]);
Object arrayElementTmp2 = arrayOI.getListElement(array, positions[position2]);
return ObjectInspectorUtils.compare(arrayElementTmp1, arrayElementOI, arrayElementTmp2, arrayElementOI);
}
即参数中增加int[] positions,传递进来leftPositions或者rightPositions,相应方法调用出也需一并修改。
array_intersect 有数组越界bug,代码也复杂,简化代码:
` @OverRide
public Object evaluate(DeferredObject[] arguments) throws HiveException {
Object leftArray = arguments[0].get();
Object rightArray = arguments[1].get();
int leftArrayLength = leftArrayOI.getListLength(leftArray);
int rightArrayLength = rightArrayOI.getListLength(rightArray);
// Check if array is null or empty
if (leftArray == null || rightArray == null || leftArrayLength < 0 || rightArrayLength < 0) {
return null;
}
if (leftArrayLength == 0) {
return leftArray;
}
if (rightArrayLength == 0) {
return rightArray;
}
List<?> leftList = leftArrayOI.getList(leftArray);
List<?> rightList = leftArrayOI.getList(rightArray);
HashSet<?> result_set = Sets.newHashSet(leftList);
result_set.retainAll(rightList);
return new ArrayList(result_set);
}`
Hello,
I tried to apply these UDFs in spark with hive support. Here is the code:
// register UDF
spark.sql("create temporary function id_card_province as 'cc.shanruifeng.functions.card.UDFChinaIdCardProvince'");
// get file
Dataset<Row> rawdata = spark.read().csv("./src/main/resources/starM.csv");
// use UDF
rawdata.createOrReplaceTempView("starM");
Dataset<Row> udfModified = spark.sql("SELECT *, id_card_province(_c13) FROM starM");
udfModified.show();
and I got error:
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public org.apache.hadoop.io.Text cc.shanruifeng.functions.card.UDFChinaIdCardProvince.evaluate(org.apache.hadoop.io.Text) on object cc.shanruifeng.functions.card.UDFChinaIdCardProvince@5c622859 of class cc.shanruifeng.functions.card.UDFChinaIdCardProvince with arguments {652423184510291234:org.apache.hadoop.io.Text} of size 1
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:981)
at org.apache.spark.sql.hive.HiveSimpleUDF.eval(hiveUDFs.scala:91)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_6$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:957)
... 18 more
Caused by: java.lang.NullPointerException
at org.apache.hadoop.io.Text.encode(Text.java:450)
at org.apache.hadoop.io.Text.set(Text.java:198)
at cc.shanruifeng.functions.card.UDFChinaIdCardProvince.evaluate(UDFChinaIdCardProvince.java:23)
... 23 more
Could you please give me some advice on this problem? The column is not null anyway.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.