linkedinattic / datafu Goto Github PK

View Code? Open in Web Editor NEW

585.0 585.0 137.0 32.12 MB

Hadoop library for large-scale data processing, now an Apache Incubator project

Home Page: http://datafu.incubator.apache.org/

datafu's People

Contributors

Stargazers

Watchers

Forkers

jwills raj-rangaswamy mbastian william-g-vaughan acnithin idmachines antoniogarrote mkhad pragnesh wujiee yuanke sguo klausjung srinivas-surasani sayan ccsevers mitultiwari evionkim orls metaresolver suryasev w2ogroup mravi vnivargi xplenty open-source-gis tobywang suhailshergill thaingo akms17 prashantkommireddi smidec rmelick huobao36 dcobb njwhite ratnakar fivesheep ajaychitre fv3386 lunbobi velankanisys sxiaohai cw11oyd zlingh pathaksa darionyaphets vamsijkrishna msukmanowsky devinshields yorical newsky vambati dizhu lijinhui gigfork evoluguy vrameshinbox sfbigdata hnflee jtzemr joycechan1988 tehsuper samshah mengxr tomz gandlachinnikrishna ashishb ccstartfish101 strategist922 steamshon markgrover wwwtravel king821221 tomtongbinwu josephmartz wqren mailmahee nedqian natevalentin1 scottburger bag-of-projects liuyang1921 gholmes rjurney gayakwad coderplay chrisbg balajiathreya biddyweb ritusaxena1 scholarpallavi shelocks perryhau kartik4032 luyao1986 mrflip carolinux wendyran tmaher

datafu's Issues

Variance Issue?

Hi Matt,

I was using exactly the same example as in README to compute the variance, except I cannot use input as my variable name (otherwise I get the mismatched input 'input' expecting EOF error), but I've got 6.66666... as my result. Would you help explain where I got it wrong? Appreciate it!

The version I'm using is:
Apache Pig version 0.10.1.4.1304150518 (rexported)

And here's the code I used:

register datafu-0.0.8.jar;

define VAR datafu.pig.stats.VAR();

-- input: 1,2,3,4,5,6,7,8,9
a = LOAD 'input' AS (val:int);

grouped = GROUP a ALL;
-- produces variance of 7.5
variance = FOREACH grouped GENERATE VAR(a.val);

dump variance;
-- (6.666666666666668)

Hourglass fixed-length windows should be robust to reappearing data

For a fixed-length window where output is being reused, the oldest day is "subtracted" from the previous output as the window advances. However it's possible a day may be missing when output is computed. If the data then reappears later, it will eventually be subtracted off, even though it wasn't included in the output previously. This could yield "negative" values.

Hourglass doesn't currently track the intervals included in the output, only the start and end time. This could easily be solved by including the interval coverage as well. It also means that when the data reappears it can be included in the next output.

possible error in getting started with hourglass documentation

Hi,

Can someone please tell me how do i get started in hourglass with the sample problem mentioned in the guide - http://datafu.incubator.apache.org/docs/hourglass/getting-started.html

It says there -

Your other option is to download the code and build the JAR yourself. After unzipping the archive, navigate to contrib/hourglass and build the JAR by running ant jar. The dependencies will be downloaded to lib/common.

But i don't see the contrib folder in the datafu-master.zip dir. Also i dont see `contrib folder after cloning git://git.apache.org/incubator-datafu.git

Could someone pls. help me with getting started in hourglass ? Much appreciated.

Also couldn't find any buiild.xml or targets anywhere . Not sure where i can run ant jar.

If there is another guide please point me to it.

Thanks

Unable to create UDF with optional 2nd argument

How can I create a UDF with an optional 2nd argument? Two call() methods isn't working.

MarkovChain doesn't seem to work properly

Session Input Data:

(1970-01-16T09:32:51.178Z,USER3,/A,61c20dfb-c17f-44bf-861e-683ec8431dba)
(1970-01-16T09:32:58.200Z,USER3,/B,61c20dfb-c17f-44bf-861e-683ec8431dba)
(1970-01-16T09:33:01.300Z,USER3,/C,61c20dfb-c17f-44bf-861e-683ec8431dba)
(1970-01-16T09:33:03.400Z,USER3,/D,61c20dfb-c17f-44bf-861e-683ec8431dba)

Output Data:
({((/D),(/A)),((/A),(/B)),((/B),(/C))})

The first pair should be reverse.

Option to have Sessionize UDF spit out Long value instead of GUID

It would be great to be able to get back a long instead of a GUID. Ended up writing my own UDF to do this :/

Quantile and StreamingQuantile don't work - 'can't work with argument null'

file ./topics.pig, line 31, column 62> Failed to generate logical plan. Nested exception: java.lang.RuntimeException: could not instantiate 'datafu.pig.stats.Quantile' with arguments 'null'

When:

quantiles = foreach (group token_counts all) generate FLATTEN(datafu.pig.stats.Quantile('0.10', '0.90')) as (low_ten, high_ten);

Release : When can we expect the 1.3 release of DataFu ?

Thanks for the fix to SampleByKey issue. Please let us know when can we expect the release that contains this fix.

Or If the build instructions are documented somewhere I can get the appropriate patch and get a temporary build till the official JAR is out.

Thanks
Bala

Use pigunit from maven central repo

Since pig 11, pigunit is now available in the maven central repo. We should use this version instead of the checked in JAR.

datafu.pig.sampling.ReservoirSample

HI when using ReservoirSample it seems like the sample is done on the full input instead of the group input.

e.g. Lets say my input.txt is
a1,5
a1,6
a1,7
a2,5
a2,6
a2,7

I have the following program:
DEFINE SRS datafu.pig.sampling.ReservoirSample('2');
data = LOAD 'input.txt' USING PigStorage(',') AS (key: chararray, value: chararray);
grouped = GROUP data BY key;
sample2 = FOREACH grouped GENERATE FLATTEN(SRS(data));

The output was. I would assume that that for each group it would sample (meaning I would see 2 samples of a1 and 2 samples of a2...why is that not)?
(a1,7)
(a1,6)
(a1,6)
(a1,6)

Getting java.lang.NullPointerException in running PageRank

I have a pair of 35M of links from 117K nodes and ran pagerank job on 3 node m2.2xlarge EMR cluster. Initially I got out of memory error in the reduce phase so I increased the JVM size and then now I am getting the following error (and this happens in one reduce job and the other 3 completes without any error):

2015-01-04 03:44:40,349 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: topic_ranks: New For Each(false,false,false)[bag] - scope-42 Operator Key: scope-42): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing [POUserFunc (Name: POUserFunc(datafu.pig.linkanalysis.PageRank)[bag] - scope-33 Operator Key: scope-33) children: null at []]: java.lang.NullPointerException
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:289)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:242)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:464)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:432)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:412)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:256)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:635)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

Any idea how to resolve this issue? I used Hadoop 2.4.0 and Pig 0.12.0.

Fix jar-instrumented task on build.xml

The jar-instrumented task on build.xml isn't working because it's referencing an inexistent jar.
The zipfileset should be fixed to ${name}-${version}.jar instead of ${name}-${datafu.version}.jar

...

-    <jarjar jarfile="${instrumented.dir}/${name}-${datafu.version}.jar">      
-      <zipfileset src="${dist.dir}/${name}-${datafu.version}.jar"/>        
+    <jarjar jarfile="${instrumented.dir}/${name}-${datafu.version}.jar">
+      <zipfileset src="${dist.dir}/${name}-${version}.jar"/>

...

Sorry I didn't submit a PR, I don't have the repo cloned.

Implement and experiment with different weighted sampling algorithms

This is the reference paper I use to learn about the weighted sampleing algorithm: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf

The present WeightedSample.java implements the Algorithm D.

We may try Algorithm A, A-res and A-expJ since they could be used in a data stream and distributed environment. These algorithms could be implemented based on ReservoirSample.java(inherit from this class?) since they also need a reservior to store the selected items.

Quantile and StreamingQuantile use inconsistent names for schemas

This may be kind of minor, but I added an explain to quantile5aTest and streamingQuantile5aTest. The generated names in the schem are different.

StreamingQuantile('10') yields:

data_out: (Name: LOStore Schema: quantiles#50:tuple(quantile_0#51:double,quantile_1#52:double,quantile_2#53:double,quantile_3#54:double,quantile_4#55:double,quantile_5#56:double,quantile_6#57:double,quantile_7#58:double,quantile_8#59:double,quantile_9#60:double))

Quantile('10') yields:

data_out: (Name: LOStore Schema: quantiles#42:tuple(quantile_0_0#43:double,quantile_0_1111111111111111#44:double,quantile_0_2222222222222222#45:double,quantile_0_3333333333333333#46:double,quantile_0_4444444444444444#47:double,quantile_0_5555555555555556#48:double,quantile_0_6666666666666666#49:double,quantile_0_7777777777777778#50:double,quantile_0_8888888888888888#51:double,quantile_1_0#52:double))

Does it work with org.apache.pig.builtin.MonitoredUDF ?

Hi, If I use

@MonitoredUDF(timeUnit = TimeUnit.MINUTES, duration = 10, errorCallback = NplRecMatcherErrorCallback.class)
class NplRecFirstLevelMatcher extends AliasableEvalFunc<Tuple> implements DebuggableUDF{
//some cool stuff goes here!
}

I do get exception:

14/01/15 23:52:52 ERROR udf.NplRecFirstLevelMatcher: Class: class NplRecFirstLevelMatcher
14/01/15 23:52:52 ERROR udf.NplRecFirstLevelMatcher: Instance name: 30
14/01/15 23:52:52 ERROR udf.NplRecFirstLevelMatcher: Properties: {30={}}
*** ***A debug output from my handler method***  ***
NplRecMatcherErrorCallback.handleError

null
ERROR: java.lang.RuntimeException: Could not retrieve aliases from properties using aliasMap
java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not retrieve aliases from properties using aliasMap
    at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
    at java.util.concurrent.FutureTask.get(FutureTask.java:91)
    at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:69)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.util.MonitoredUDFExecutor.monitorExec(MonitoredUDFExecutor.java:183)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:335)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:376)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:354)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:241)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:465)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:433)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:413)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:257)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:449)
Caused by: java.lang.RuntimeException: Could not retrieve aliases from properties using aliasMap
    at datafu.pig.util.AliasableEvalFunc.getFieldAliases(AliasableEvalFunc.java:164)
    at datafu.pig.util.AliasableEvalFunc.getPosition(AliasableEvalFunc.java:171)
    at datafu.pig.util.AliasableEvalFunc.getBag(AliasableEvalFunc.java:253)
    at datafu.pig.util.AliasableEvalFunc$getBag.callCurrent(Unknown Source)
    at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:49)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:133)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:145)
    at NplRecFirstLevelMatcher.exec(NplRecFirstLevelMatcher.groovy:53)
    at NplRecFirstLevelMatcher.exec(NplRecFirstLevelMatcher.groovy)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.util.MonitoredUDFExecutor$1.apply(MonitoredUDFExecutor.java:95)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.util.MonitoredUDFExecutor$1.apply(MonitoredUDFExecutor.java:91)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.util.MonitoredUDFExecutor$2.call(MonitoredUDFExecutor.java:164)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
    at java.lang.Thread.run(Thread.java:662)

Exception happens on line:

itemsBag = getBag(input, ORDERED)

If I put away annotation @MonitoredUDF, it works fine, tests are passed.

Hourglass cannot handle joins where schemas are inconsistent

When joining two paths, it's possible that there may be a type that appears in both schemas (i.e. same namespace and name). It's also possible that the schemas for this type may not match. It's not uncommon to share types across data sets. Backwards compatible changes are sometimes made to these shared types, such as adding a nullable field. Old data in this case can still be parsed with the new schema. But one path may be behind the other. This causes a problem with Hourglass because it constructs a union of the input schemas, as required by Avro since an input schema must be set for the job. Actually this should be a problem with any MapReduce job joining Avro data like this.

I see a couple potential solutions to this:

Reconcile all the input schemas so their types are consistent. For example, suppose Schema A and Schema B both have a field of record type X. In Schema A type X has field 1. But in Schema B type X has field 1 and field 2. Schema A could be updated so that B's version of type X is used.
Enhance Avro's Hadoop related code to support multiple inputs. Currently, as far as I can tell, Avro does not support specifying different input schemas for different mapper classes. The same schema is shared for all. AvroKeyInputFormat could be altered so that it uses one of many schemas stored in the conf, determining the correct one based on which input is to be read. Then multiple inputs could be used.

QuantileUtil.getQuantilesFromParams broken

QuantileUtil.getQuantilesFromParams("10") gives:

[0.0, 0.1111111111111111, 0.2222222222222222, 0.3333333333333333, 0.4444444444444444, 0.5555555555555556, 0.6666666666666667, 0.7777777777777779, 0.8888888888888891]

I discovered this when using the Quantile UDF with a single constructor argument and noticing that the last computed percentile did not match the max value in my dataset.

QuantileUtil.getQuantilesFromParams("0", ".1", ".2", ".3", ".4", ".5", ".6", ".7", ".8", ".9", "1") correctly gives:

[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

Pete

TimeCount - NoClassDefFoundError

When I use datafu.pig.date.TimeCount. I get this problem:

Implement Entropy UDF

In the real world, there are occasions we need to calculate the entropy of discrete random variables, for instance, to calculate the mutual information between variable X and Y using its entropy-based formula(mutual information calculation could be found at http://en.wikipedia.org/wiki/Mutual_information#Relation_to_other_quantities). Would suggest to implement a UDF to calculate the entropy of given input samples, following the definition at http://en.wikipedia.org/wiki/Entropy_%28information_theory%29

In Apache Mahout, there is an existing implementation of entropy using Map/Reduce API, please refer to https://issues.apache.org/jira/browse/MAHOUT-747.

UDF Spec:

package datafu.pig.stats;

public class Entropy extends AccumulatorEvalFunc

Constructor accepts a string to indicate its logarithm base, including: 2, e(Euler's number), and 10

Input: a bag of ordered tuples X = {x[0], x[1], ... x[N - 1]}

Output: entropy value of double type

How to use:

X = LOAD 'input' AS (val:int);
EX = FOREACH (GROUP X ALL) {OX = ORDER X by val; GENERATE Entropy(OX);}

Add datafu.pig.text package

Port OpenNLP libraries to the datafu.pig.text package, offering: Tokenize, SentenceTokenize, POSTag and Chunk UDFs

Rename TimeCount to SessionCount

Also should move to the datafu.pig.sessions namespace.

Sessionize should accept long unix timestamp, in addition to ISO string

If it's a long it should interpret as a Unix timestamp.

"more" button in the readme.md is wrongly referenced to "localhost"

Can't run one unit test

How do I run one unit test at a time?

Upgrade Pig library used to 0.12

I have a UDF which requires DataType.BIGDECIMAL/BIGINTEGER support, which was introduced in 0.12.

AliasableEvalFunc subcalss, how to test it?

Hi, I've wrote UDF in groovy using AliasableEvalFunc

I want to test it on unit level, I don't want to wrap this class with pig script.
How can I push down a tuple schema to AliasableEvalFunc subcalss?
I can't find example in tests.

I've wrote such ugly code, looks like it works:
``groovy

ZonesGenerator initUdf(){
def getAliasSchemaMap = {
def aliasSchemaAsStr = 'bag1.field1=0, bag2::moreName.field1=1'
aliasSchemaAsStr.split(',').inject([:]){memo, keyVal ->
memo[keyVal.split('=').first()] = Integer.valueOf(keyVal.split('=').last())
return memo
}
}

    new ZonesGenerator(){
        public Map<String, Integer> getFieldAliases(){
            return getAliasSchemaMap
        }
    }

}

DistinctBy output is incorrect when any value in the tuple contains a '-'

This is the result of using '-' as a delimiter which cannot be overridden. This delimiter is then used to convert the tuple values into a string and subsequently used to tokenize the generated string.

public class DistinctBy extends AccumulatorEvalFunc<DataBag>
{
  private final static String delimiter = "-";  
...
private String getDistinctString(Tuple t, HashSet<Integer> distinctFieldPositions) throws ExecException {
    String[] tokens = t.toDelimitedString(delimiter).split(delimiter);

Add ToJson UDF to serialize any relation/field as a JSON String

Pull request forthcoming.

Merge BagConcat and BagUnion

Let's name the final version BagConcat. I think people may assume that BagUnion performs a set union and be confused. With BagConcat there is less chance for confusion.

Hash UDFs should return zero-padded strings of uniform length even when leading bits are zero.

The Hash UDFs in 'hex' mode currently do not return always the same-length string, because BigInteger.toString() omits leading zeros. So amidst a stream of 94% strings the same length, 1/16th are shorter by one or more characters, 1/256th by two or more, and in the unlikely case that an MD5 hash's value was 124 bits of zeros and 4 bits of ones it would return the one-character-long string 'f'.

This is surprising behavior, and a trap for those practicing the frequent trick of generating a hash and chopping off just the number of bits you need:

-- returns one-fifteenth, not one-sixteenth, of the input.
sampled_lines = FILTER(FOREACH lines GENERATE MD5(val) AS digest, val) BY (STARTSWITH(digest, 'f'));

mrflip@5c4a77c makes the returns a string zero-padded to be (length of hash / 4) characters long. It needs a lookup table to know how to format a SHA hash; all of the potential SHA-prefixed algorithms in Java 7 are covered.