Giter Club home page Giter Club logo

datafu's People

Contributors

cw11oyd avatar matthayes avatar mtiwari avatar navteniev avatar samshah avatar talevy avatar william-g-vaughan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datafu's Issues

Variance Issue?

Hi Matt,

I was using exactly the same example as in README to compute the variance, except I cannot use input as my variable name (otherwise I get the mismatched input 'input' expecting EOF error), but I've got 6.66666... as my result. Would you help explain where I got it wrong? Appreciate it!

The version I'm using is:
Apache Pig version 0.10.1.4.1304150518 (rexported)

And here's the code I used:

register datafu-0.0.8.jar;

define VAR datafu.pig.stats.VAR();

-- input: 1,2,3,4,5,6,7,8,9
a = LOAD 'input' AS (val:int);

grouped = GROUP a ALL;
-- produces variance of 7.5
variance = FOREACH grouped GENERATE VAR(a.val);

dump variance;
-- (6.666666666666668)

Hourglass fixed-length windows should be robust to reappearing data

For a fixed-length window where output is being reused, the oldest day is "subtracted" from the previous output as the window advances. However it's possible a day may be missing when output is computed. If the data then reappears later, it will eventually be subtracted off, even though it wasn't included in the output previously. This could yield "negative" values.

Hourglass doesn't currently track the intervals included in the output, only the start and end time. This could easily be solved by including the interval coverage as well. It also means that when the data reappears it can be included in the next output.

possible error in getting started with hourglass documentation

Hi,

Can someone please tell me how do i get started in hourglass with the sample problem mentioned in the guide - http://datafu.incubator.apache.org/docs/hourglass/getting-started.html

It says there -

Your other option is to download the code and build the JAR yourself. After unzipping the archive, navigate to contrib/hourglass and build the JAR by running ant jar. The dependencies will be downloaded to lib/common.

But i don't see the contrib folder in the datafu-master.zip dir. Also i dont see `contrib folder after cloning git://git.apache.org/incubator-datafu.git

Could someone pls. help me with getting started in hourglass ? Much appreciated.

Also couldn't find any buiild.xml or targets anywhere . Not sure where i can run ant jar.

If there is another guide please point me to it.

Thanks

MarkovChain doesn't seem to work properly

Session Input Data:

(1970-01-16T09:32:51.178Z,USER3,/A,61c20dfb-c17f-44bf-861e-683ec8431dba)
(1970-01-16T09:32:58.200Z,USER3,/B,61c20dfb-c17f-44bf-861e-683ec8431dba)
(1970-01-16T09:33:01.300Z,USER3,/C,61c20dfb-c17f-44bf-861e-683ec8431dba)
(1970-01-16T09:33:03.400Z,USER3,/D,61c20dfb-c17f-44bf-861e-683ec8431dba)

Output Data:
({((/D),(/A)),((/A),(/B)),((/B),(/C))})

The first pair should be reverse.

Quantile and StreamingQuantile don't work - 'can't work with argument null'

file ./topics.pig, line 31, column 62> Failed to generate logical plan. Nested exception: java.lang.RuntimeException: could not instantiate 'datafu.pig.stats.Quantile' with arguments 'null'

When:

quantiles = foreach (group token_counts all) generate FLATTEN(datafu.pig.stats.Quantile('0.10', '0.90')) as (low_ten, high_ten);

Release : When can we expect the 1.3 release of DataFu ?

Thanks for the fix to SampleByKey issue. Please let us know when can we expect the release that contains this fix.

Or If the build instructions are documented somewhere I can get the appropriate patch and get a temporary build till the official JAR is out.

Thanks
Bala

datafu.pig.sampling.ReservoirSample

HI when using ReservoirSample it seems like the sample is done on the full input instead of the group input.

e.g. Lets say my input.txt is
a1,5
a1,6
a1,7
a2,5
a2,6
a2,7

I have the following program:
DEFINE SRS datafu.pig.sampling.ReservoirSample('2');
data = LOAD 'input.txt' USING PigStorage(',') AS (key: chararray, value: chararray);
grouped = GROUP data BY key;
sample2 = FOREACH grouped GENERATE FLATTEN(SRS(data));

The output was. I would assume that that for each group it would sample (meaning I would see 2 samples of a1 and 2 samples of a2...why is that not)?
(a1,7)
(a1,6)
(a1,6)
(a1,6)

Getting java.lang.NullPointerException in running PageRank

I have a pair of 35M of links from 117K nodes and ran pagerank job on 3 node m2.2xlarge EMR cluster. Initially I got out of memory error in the reduce phase so I increased the JVM size and then now I am getting the following error (and this happens in one reduce job and the other 3 completes without any error):

2015-01-04 03:44:40,349 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: topic_ranks: New For Each(false,false,false)[bag] - scope-42 Operator Key: scope-42): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing [POUserFunc (Name: POUserFunc(datafu.pig.linkanalysis.PageRank)[bag] - scope-33 Operator Key: scope-33) children: null at []]: java.lang.NullPointerException
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:289)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:242)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:464)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:432)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:412)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:256)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:635)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

Any idea how to resolve this issue? I used Hadoop 2.4.0 and Pig 0.12.0.

Fix jar-instrumented task on build.xml

The jar-instrumented task on build.xml isn't working because it's referencing an inexistent jar.
The zipfileset should be fixed to ${name}-${version}.jar instead of ${name}-${datafu.version}.jar

...

-    <jarjar jarfile="${instrumented.dir}/${name}-${datafu.version}.jar">      
-      <zipfileset src="${dist.dir}/${name}-${datafu.version}.jar"/>        
+    <jarjar jarfile="${instrumented.dir}/${name}-${datafu.version}.jar">
+      <zipfileset src="${dist.dir}/${name}-${version}.jar"/>

...

Sorry I didn't submit a PR, I don't have the repo cloned.

Implement and experiment with different weighted sampling algorithms

This is the reference paper I use to learn about the weighted sampleing algorithm: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf

The present WeightedSample.java implements the Algorithm D.

We may try Algorithm A, A-res and A-expJ since they could be used in a data stream and distributed environment. These algorithms could be implemented based on ReservoirSample.java(inherit from this class?) since they also need a reservior to store the selected items.

Quantile and StreamingQuantile use inconsistent names for schemas

This may be kind of minor, but I added an explain to quantile5aTest and streamingQuantile5aTest. The generated names in the schem are different.

StreamingQuantile('10') yields:

data_out: (Name: LOStore Schema: quantiles#50:tuple(quantile_0#51:double,quantile_1#52:double,quantile_2#53:double,quantile_3#54:double,quantile_4#55:double,quantile_5#56:double,quantile_6#57:double,quantile_7#58:double,quantile_8#59:double,quantile_9#60:double))

Quantile('10') yields:

data_out: (Name: LOStore Schema: quantiles#42:tuple(quantile_0_0#43:double,quantile_0_1111111111111111#44:double,quantile_0_2222222222222222#45:double,quantile_0_3333333333333333#46:double,quantile_0_4444444444444444#47:double,quantile_0_5555555555555556#48:double,quantile_0_6666666666666666#49:double,quantile_0_7777777777777778#50:double,quantile_0_8888888888888888#51:double,quantile_1_0#52:double))

Does it work with org.apache.pig.builtin.MonitoredUDF ?

Hi, If I use

@MonitoredUDF(timeUnit = TimeUnit.MINUTES, duration = 10, errorCallback = NplRecMatcherErrorCallback.class)
class NplRecFirstLevelMatcher extends AliasableEvalFunc<Tuple> implements DebuggableUDF{
//some cool stuff goes here!
}

I do get exception:

14/01/15 23:52:52 ERROR udf.NplRecFirstLevelMatcher: Class: class NplRecFirstLevelMatcher
14/01/15 23:52:52 ERROR udf.NplRecFirstLevelMatcher: Instance name: 30
14/01/15 23:52:52 ERROR udf.NplRecFirstLevelMatcher: Properties: {30={}}
*** ***A debug output from my handler method***  ***
NplRecMatcherErrorCallback.handleError

null
ERROR: java.lang.RuntimeException: Could not retrieve aliases from properties using aliasMap
java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not retrieve aliases from properties using aliasMap
    at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
    at java.util.concurrent.FutureTask.get(FutureTask.java:91)
    at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:69)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.util.MonitoredUDFExecutor.monitorExec(MonitoredUDFExecutor.java:183)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:335)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:376)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:354)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:241)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:465)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:433)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:413)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:257)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:449)
Caused by: java.lang.RuntimeException: Could not retrieve aliases from properties using aliasMap
    at datafu.pig.util.AliasableEvalFunc.getFieldAliases(AliasableEvalFunc.java:164)
    at datafu.pig.util.AliasableEvalFunc.getPosition(AliasableEvalFunc.java:171)
    at datafu.pig.util.AliasableEvalFunc.getBag(AliasableEvalFunc.java:253)
    at datafu.pig.util.AliasableEvalFunc$getBag.callCurrent(Unknown Source)
    at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:49)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:133)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:145)
    at NplRecFirstLevelMatcher.exec(NplRecFirstLevelMatcher.groovy:53)
    at NplRecFirstLevelMatcher.exec(NplRecFirstLevelMatcher.groovy)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.util.MonitoredUDFExecutor$1.apply(MonitoredUDFExecutor.java:95)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.util.MonitoredUDFExecutor$1.apply(MonitoredUDFExecutor.java:91)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.util.MonitoredUDFExecutor$2.call(MonitoredUDFExecutor.java:164)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
    at java.lang.Thread.run(Thread.java:662)

Exception happens on line:

itemsBag = getBag(input, ORDERED)

If I put away annotation @MonitoredUDF, it works fine, tests are passed.

Hourglass cannot handle joins where schemas are inconsistent

When joining two paths, it's possible that there may be a type that appears in both schemas (i.e. same namespace and name). It's also possible that the schemas for this type may not match. It's not uncommon to share types across data sets. Backwards compatible changes are sometimes made to these shared types, such as adding a nullable field. Old data in this case can still be parsed with the new schema. But one path may be behind the other. This causes a problem with Hourglass because it constructs a union of the input schemas, as required by Avro since an input schema must be set for the job. Actually this should be a problem with any MapReduce job joining Avro data like this.

I see a couple potential solutions to this:

  1. Reconcile all the input schemas so their types are consistent. For example, suppose Schema A and Schema B both have a field of record type X. In Schema A type X has field 1. But in Schema B type X has field 1 and field 2. Schema A could be updated so that B's version of type X is used.

  2. Enhance Avro's Hadoop related code to support multiple inputs. Currently, as far as I can tell, Avro does not support specifying different input schemas for different mapper classes. The same schema is shared for all. AvroKeyInputFormat could be altered so that it uses one of many schemas stored in the conf, determining the correct one based on which input is to be read. Then multiple inputs could be used.

QuantileUtil.getQuantilesFromParams broken

QuantileUtil.getQuantilesFromParams("10") gives:

[0.0, 0.1111111111111111, 0.2222222222222222, 0.3333333333333333, 0.4444444444444444, 0.5555555555555556, 0.6666666666666667, 0.7777777777777779, 0.8888888888888891]

I discovered this when using the Quantile UDF with a single constructor argument and noticing that the last computed percentile did not match the max value in my dataset.

QuantileUtil.getQuantilesFromParams("0", ".1", ".2", ".3", ".4", ".5", ".6", ".7", ".8", ".9", "1") correctly gives:

[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

Pete

Implement Entropy UDF

In the real world, there are occasions we need to calculate the entropy of discrete random variables, for instance, to calculate the mutual information between variable X and Y using its entropy-based formula(mutual information calculation could be found at http://en.wikipedia.org/wiki/Mutual_information#Relation_to_other_quantities). Would suggest to implement a UDF to calculate the entropy of given input samples, following the definition at http://en.wikipedia.org/wiki/Entropy_%28information_theory%29

In Apache Mahout, there is an existing implementation of entropy using Map/Reduce API, please refer to https://issues.apache.org/jira/browse/MAHOUT-747.

UDF Spec:

package datafu.pig.stats;

public class Entropy extends AccumulatorEvalFunc

Constructor accepts a string to indicate its logarithm base, including: 2, e(Euler's number), and 10

Input: a bag of ordered tuples X = {x[0], x[1], ... x[N - 1]}

Output: entropy value of double type

How to use:

X = LOAD 'input' AS (val:int);
EX = FOREACH (GROUP X ALL) {OX = ORDER X by val; GENERATE Entropy(OX);}

Add datafu.pig.text package

Port OpenNLP libraries to the datafu.pig.text package, offering: Tokenize, SentenceTokenize, POSTag and Chunk UDFs

AliasableEvalFunc subcalss, how to test it?

Hi, I've wrote UDF in groovy using AliasableEvalFunc

I want to test it on unit level, I don't want to wrap this class with pig script.
How can I push down a tuple schema to AliasableEvalFunc subcalss?
I can't find example in tests.

I've wrote such ugly code, looks like it works:
``groovy

ZonesGenerator initUdf(){
def getAliasSchemaMap = {
def aliasSchemaAsStr = 'bag1.field1=0, bag2::moreName.field1=1'
aliasSchemaAsStr.split(',').inject([:]){memo, keyVal ->
memo[keyVal.split('=').first()] = Integer.valueOf(keyVal.split('=').last())
return memo
}
}

    new ZonesGenerator(){
        public Map<String, Integer> getFieldAliases(){
            return getAliasSchemaMap
        }
    }

}

``

DistinctBy output is incorrect when any value in the tuple contains a '-'

This is the result of using '-' as a delimiter which cannot be overridden. This delimiter is then used to convert the tuple values into a string and subsequently used to tokenize the generated string.

public class DistinctBy extends AccumulatorEvalFunc<DataBag>
{
  private final static String delimiter = "-";  
...
private String getDistinctString(Tuple t, HashSet<Integer> distinctFieldPositions) throws ExecException {
    String[] tokens = t.toDelimitedString(delimiter).split(delimiter);

Merge BagConcat and BagUnion

Let's name the final version BagConcat. I think people may assume that BagUnion performs a set union and be confused. With BagConcat there is less chance for confusion.

Hash UDFs should return zero-padded strings of uniform length even when leading bits are zero.

The Hash UDFs in 'hex' mode currently do not return always the same-length string, because BigInteger.toString() omits leading zeros. So amidst a stream of 94% strings the same length, 1/16th are shorter by one or more characters, 1/256th by two or more, and in the unlikely case that an MD5 hash's value was 124 bits of zeros and 4 bits of ones it would return the one-character-long string 'f'.

This is surprising behavior, and a trap for those practicing the frequent trick of generating a hash and chopping off just the number of bits you need:

-- returns one-fifteenth, not one-sixteenth, of the input.
sampled_lines = FILTER(FOREACH lines GENERATE MD5(val) AS digest, val) BY (STARTSWITH(digest, 'f'));

mrflip@5c4a77c makes the returns a string zero-padded to be (length of hash / 4) characters long. It needs a lookup table to know how to format a SHA hash; all of the potential SHA-prefixed algorithms in Java 7 are covered.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.