cascading / maple Goto Github PK
View Code? Open in Web Editor NEWAll the Cascading taps you need and love.
All the Cascading taps you need and love.
Hi everyone,
Could you kindly provide some detail about how to use HBase support in maple for testing jobs with the JobTest class included in twitter scalding.
Thank you in Advance
When using maple to import a 40GB+ Postgres database I noticed that queries became too slow and the complete hadoop job failed because of the use of OFFSET:
After changing this line to this:
// HARDCODING PRIMARY KEY.....
query.append(" WHERE id >= ").append(split.getStart());
query.append(" LIMIT ").append(split.getLength());
The query time doesn't grow exponentially anymore and stays the same. The above is not a generic solution (e.g. your index might not be id). Do you have suggestions to handle this situation? I'm also not sure how other JDBC databases handle OFFSET.
Has this library been used on large Postgres DB's before? I would like to gain some insights into best practices. Even with the above optimization my import time is around 3 hours.
Thanks for you work on maple.
Cheers,
Jeroen
use org.apache.hadoop.mapreduce instead of deprecated org.apache.hadoop.mapred
This would support splits for databases that don't allow limit and offset.
Similar Implementations exist in Sqoop (https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/mapreduce/db/DataDrivenDBInputFormat.java) and Hadoop (https://github.com/apache/hadoop-mapreduce/blob/trunk/src/java/org/apache/hadoop/mapreduce/lib/db/DataDrivenDBInputFormat.java). However, due to implementation differences some translation and transfer would have to occur.
I can do this work but I wanted to see what others think. Do you guys have any recommendations for this?
When using the JDBCTap with an Oracle database (using Oracle's ojdbc6.jar driver) the flow fails with an IOException:
Caused by: java.io.IOException: unable to execute insert batch [msglength: 29][totstmts: 1000][crntstmts: 1000][batch: 1000] ORA-00911: invalid character
at com.twitter.maple.jdbc.db.DBOutputFormat$DBRecordWriter.createThrowMessage(Unknown Source)
at com.twitter.maple.jdbc.db.DBOutputFormat$DBRecordWriter.executeBatch(Unknown Source)
at com.twitter.maple.jdbc.db.DBOutputFormat$DBRecordWriter.write(Unknown Source)
at com.twitter.maple.jdbc.db.DBOutputFormat$DBRecordWriter.write(Unknown Source)
at com.twitter.maple.jdbc.JDBCTapCollector.collect(Unknown Source)
at com.twitter.maple.jdbc.JDBCScheme.sink(Unknown Source)
at cascading.tuple.TupleEntrySchemeCollector.collect(TupleEntrySchemeCollector.java:153)
This is fixed by removing the "query.append(";");" on line 276 of DBOutputFormat.java and removed the semicolon from the query.append(");") on line 231. Apparently the Oracle JDBC driver doesn't like the semicolon on the end of the SQL statement.
When there is a configuration issue or ZK isn't running, the error message being reported is that the table does not exist. In the case below, the table does exist - it just doesn't know that because it hasn't connected to zookeeper.
2012-07-25 17:40:20,588 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.RuntimeException: TABLENAME does not exist !
at com.twitter.maple.hbase.HBaseTap.sinkConfInit(Unknown Source)
at com.twitter.maple.hbase.HBaseTap.sinkConfInit(Unknown Source)
at com.twitter.maple.hbase.HBaseTapCollector.initialize(Unknown Source)
at com.twitter.maple.hbase.HBaseTapCollector.prepare(Unknown Source)
at com.twitter.maple.hbase.HBaseTap.openForWrite(Unknown Source)
at com.twitter.maple.hbase.HBaseTap.openForWrite(Unknown Source)
at cascading.flow.stream.SinkStage.prepare(SinkStage.java:60)
at cascading.flow.stream.StreamGraph.prepare(StreamGraph.java:165)
at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:107)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Here's a stack trace.
cascading.tuple.TupleException: unable to read from input identifier: 'unknown'
at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:127)
at cascading.flow.stream.SourceStage.map(SourceStage.java:76)
at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:124)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hbase.io.ImmutableBytesWritable.(ImmutableBytesWritable.java:60)
at com.twitter.maple.hbase.HBaseScheme.source(Unknown Source)
at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:140)
at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:120)
... 6 more
If my table in hbase has a schema that looks like the following.
table_name: test
column_family : cf
Say that my HbaseScheme is expecting value fields: foo and bar and the test table has the following rows.
1, cf:foo="hello", cf:bar="world"
2, cf:foo="bye"
Row 2 will cause the exception described above.
I'd expect that an empty byte array will be returned for row 2's cf:bar column.
The identifier should be a combination of all unique attributes, so that Cascading can optimize resources. (Currently we just add a UUID onto the end of the connection URL.)
I have an HBase scheme/tap written for a Cascading 2.0 pre-release version that I would love to replace with the HBase tap/scheme in maple. One issue I'm running into is the code assumes row keys and values are strings. I'm using bytes as the key and thrift structures serialized to bytes for the values.
Is there any interest in making the maple HBaseScheme more flexible in this regard? Looks like the scheme source code just puts the bytes in a tuple. Maybe the sink code could do the same?
Does this support Hbase filters (http://hbase.apache.org/book/client.filter.html) such as SingleColumnValueFilter ?
Hi,
I've recently ran into an issue when trying to write byte arrays using JDBCTap (I'm using bytea
type in PostgreSQL). The issue is almost identical to the one resolved by this pull request but concerns writing instead of reading objects from Postgres. Basically it boils down to the fact that cascading.tuple.Tuple's get method does a cast to Comparable which of course breaks types that don't implement said interface.
I have a patch for this issue, which solves it for me without breaking any of existing code.
Cheers.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.