Comments (3)
I think it's likely this would significantly improve performance. Currently I see a lot of processor time being spent in threads that look like this:
"MapPartition (map-DB4889E9084A472E82A7F42B3482D1F6) (3/20)" daemon prio=10 tid=0x00007f556747d800 nid=0x6690 runnable [0x00007f5569647000]
java.lang.Thread.State: RUNNABLE
at com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.write(DefaultSerializers.java:173)
at com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.write(DefaultSerializers.java:166)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:599)
at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.serialize(KryoSerializer.java:199)
at com.dataartisans.flink.cascading.types.tuple.DefinedTupleSerializer.serialize(DefinedTupleSerializer.java:142)
at com.dataartisans.flink.cascading.types.tuple.DefinedTupleSerializer.serialize(DefinedTupleSerializer.java:29)
at org.apache.flink.runtime.plugable.SerializationDelegate.write(SerializationDelegate.java:56)
at org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializer.addRecord(SpanningRecordSerializer.java:79)
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:84)
- locked <0x00000006e78d9510> (a org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializer)
at org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65)
at com.dataartisans.flink.cascading.runtime.boundaryStages.BoundaryOutStage.receive(BoundaryOutStage.java:44)
at com.dataartisans.flink.cascading.runtime.boundaryStages.BoundaryOutStage.receive(BoundaryOutStage.java:29)
at cascading.flow.stream.element.FunctionEachStage$1.collect(FunctionEachStage.java:81)
from cascading-flink.
Would it be OK to have a dependency on Hadoop for this? I assume yes, in which case it seems possible to wire up Cascading's TupleSerialization (or a version thereof) versus using Kryo.
from cascading-flink.
Flink has a dependency on Hadoop. So there should already be a transitive dependency on Hadoop.
If it is an option for you, you can also add types to the Fields declaration. In that case, the connector will use Flink's "native" serializers and comparators which support comparisons on the binary representation.
In any case, it would be good to improve the serialization for untyped records.
from cascading-flink.
Related Issues (20)
- Copy methods of DefinedTupleSerializer do not handle null properly HOT 1
- translateHashJoin should allow Joiner subclasses HOT 4
- Obtaining Flink plan with './bin/flink info' throws ClassCastException HOT 1
- Make BATCH the default execution mode and add a parameter switch to PIPELINED HOT 1
- setNumSinkParts is currently ignored HOT 2
- exclude *TestCase.java from surefire configuration HOT 1
- Example WordCount program should specify SinkMode.REPLACE for outTap HOT 1
- Bump version in master to 0.2-SNAPSHOT (or later) HOT 3
- Create default Configuration in no-arg FlinkFlowProcess constructor HOT 1
- Sorting with GroupBy doesn't work HOT 5
- Support finer-grained control over parallelism HOT 1
- Source and shuffle parallelism settings don't seem to work
- Push a 0.3 release HOT 9
- Use ConcurrentHashMap for Accumulators HOT 2
- FlinkFlowStepStats.getCountersFor should return counter names, not groupCounter names HOT 3
- Pass JobConf to Schemes, not Configuration
- tag for 0.2 release is missing HOT 1
- stepsAreLocal() of FlinkFlow class is always returning false, even when job runs in local.
- Cannot merge pipes from ORC file
- Cascading-Flink not working with EMR Flink 1.3.2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cascading-flink.