cascading / maple Goto Github PK

All the Cascading taps you need and love.

Clojure 1.83% Java 98.17%

maple's Issues

Testing with twitter scalding JobTest

Hi everyone,
Could you kindly provide some detail about how to use HBase support in maple for testing jobs with the JobTest class included in twitter scalding.
Thank you in Advance

Use of OFFSET very inefficient with large Postgres DB

When using maple to import a 40GB+ Postgres database I noticed that queries became too slow and the complete hadoop job failed because of the use of OFFSET:

After changing this line to this:

            // HARDCODING PRIMARY KEY.....
            query.append(" WHERE id >= ").append(split.getStart());
            query.append(" LIMIT ").append(split.getLength());

The query time doesn't grow exponentially anymore and stays the same. The above is not a generic solution (e.g. your index might not be id). Do you have suggestions to handle this situation? I'm also not sure how other JDBC databases handle OFFSET.

Has this library been used on large Postgres DB's before? I would like to gain some insights into best practices. Even with the above optimization my import time is around 3 hours.

Thanks for you work on maple.

Cheers,
Jeroen

use org.apache.hadoop.mapreduce

use org.apache.hadoop.mapreduce instead of deprecated org.apache.hadoop.mapred

Support DataDrivenDBInputFormat

This would support splits for databases that don't allow limit and offset.

Similar Implementations exist in Sqoop (https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/mapreduce/db/DataDrivenDBInputFormat.java) and Hadoop (https://github.com/apache/hadoop-mapreduce/blob/trunk/src/java/org/apache/hadoop/mapreduce/lib/db/DataDrivenDBInputFormat.java). However, due to implementation differences some translation and transfer would have to occur.

I can do this work but I wanted to see what others think. Do you guys have any recommendations for this?

JDBCTap fails on Oracle with "ORA-00911: invalid character"

When using the JDBCTap with an Oracle database (using Oracle's ojdbc6.jar driver) the flow fails with an IOException:

Caused by: java.io.IOException: unable to execute insert batch [msglength: 29][totstmts: 1000][crntstmts: 1000][batch: 1000] ORA-00911: invalid character

at com.twitter.maple.jdbc.db.DBOutputFormat$DBRecordWriter.createThrowMessage(Unknown Source)
at com.twitter.maple.jdbc.db.DBOutputFormat$DBRecordWriter.executeBatch(Unknown Source)
at com.twitter.maple.jdbc.db.DBOutputFormat$DBRecordWriter.write(Unknown Source)
at com.twitter.maple.jdbc.db.DBOutputFormat$DBRecordWriter.write(Unknown Source)
at com.twitter.maple.jdbc.JDBCTapCollector.collect(Unknown Source)
at com.twitter.maple.jdbc.JDBCScheme.sink(Unknown Source)
at cascading.tuple.TupleEntrySchemeCollector.collect(TupleEntrySchemeCollector.java:153)

This is fixed by removing the "query.append(";");" on line 276 of DBOutputFormat.java and removed the semicolon from the query.append(");") on line 231. Apparently the Oracle JDBC driver doesn't like the semicolon on the end of the SQL statement.

HBase Tap reports misleading error 'table is missing'

When there is a configuration issue or ZK isn't running, the error message being reported is that the table does not exist. In the case below, the table does exist - it just doesn't know that because it hasn't connected to zookeeper.

2012-07-25 17:40:20,588 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.RuntimeException: TABLENAME does not exist !
at com.twitter.maple.hbase.HBaseTap.sinkConfInit(Unknown Source)
at com.twitter.maple.hbase.HBaseTap.sinkConfInit(Unknown Source)
at com.twitter.maple.hbase.HBaseTapCollector.initialize(Unknown Source)
at com.twitter.maple.hbase.HBaseTapCollector.prepare(Unknown Source)
at com.twitter.maple.hbase.HBaseTap.openForWrite(Unknown Source)
at com.twitter.maple.hbase.HBaseTap.openForWrite(Unknown Source)
at cascading.flow.stream.SinkStage.prepare(SinkStage.java:60)
at cascading.flow.stream.StreamGraph.prepare(StreamGraph.java:165)
at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:107)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.mapred.Child.main(Child.java:264)

Missing cell value causes a nullpointer exception.

Here's a stack trace.

cascading.tuple.TupleException: unable to read from input identifier: 'unknown'
at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:127)
at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:124)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hbase.io.ImmutableBytesWritable.(ImmutableBytesWritable.java:60)
at com.twitter.maple.hbase.HBaseScheme.source(Unknown Source)
at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:140)
at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:120)
... 6 more

If my table in hbase has a schema that looks like the following.
table_name: test
column_family : cf

Say that my HbaseScheme is expecting value fields: foo and bar and the test table has the following rows.

1, cf:foo="hello", cf:bar="world"
2, cf:foo="bye"

Row 2 will cause the exception described above.

I'd expect that an empty byte array will be returned for row 2's cf:bar column.

JDBCTap should return properly from getIdentifier()

The identifier should be a combination of all unique attributes, so that Cascading can optimize resources. (Currently we just add a UUID onto the end of the connection URL.)

HBaseScheme can only serialize strings

I have an HBase scheme/tap written for a Cascading 2.0 pre-release version that I would love to replace with the HBase tap/scheme in maple. One issue I'm running into is the code assumes row keys and values are strings. I'm using bytes as the key and thrift structures serialized to bytes for the values.

Is there any interest in making the maple HBaseScheme more flexible in this regard? Looks like the scheme source code just puts the bytes in a tuple. Maybe the sink code could do the same?

Hbase Filters

Does this support Hbase filters (http://hbase.apache.org/book/client.filter.html) such as SingleColumnValueFilter ?

JDBCTap fails when trying to write byte arrays

Hi,

I've recently ran into an issue when trying to write byte arrays using JDBCTap (I'm using bytea type in PostgreSQL). The issue is almost identical to the one resolved by this pull request but concerns writing instead of reading objects from Postgres. Basically it boils down to the fact that cascading.tuple.Tuple's get method does a cast to Comparable which of course breaks types that don't implement said interface.

I have a patch for this issue, which solves it for me without breaking any of existing code.

Cheers.

cascading / maple Goto Github PK

maple's Issues

Testing with twitter scalding JobTest

Use of OFFSET very inefficient with large Postgres DB

use org.apache.hadoop.mapreduce

Support DataDrivenDBInputFormat

JDBCTap fails on Oracle with "ORA-00911: invalid character"

HBase Tap reports misleading error 'table is missing'

Missing cell value causes a nullpointer exception.

JDBCTap should return properly from getIdentifier()

HBaseScheme can only serialize strings

Hbase Filters

JDBCTap fails when trying to write byte arrays

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent