Giter Club home page Giter Club logo

coolplayspark's People

Contributors

endymecy avatar hangim avatar jacksu avatar lw-lin avatar mtunique avatar ouyangshourui avatar wangmiao1981 avatar wongxingjun avatar xiaoguoqiang avatar zilutang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

coolplayspark's Issues

【question】在watermark下spark如何维护kafka的offset

我查看了很多文章,比如以前的直接拿到RDD的offsetRange属性等,但是往往在watermark情况下,我们都是转换过后的Dataset了,请问这种情况我们如何保证或者说如何自维护spark structured streaming当前清洗完的数据的最新的offset?

可插拔的 ReceiverSchedulingPolicy 里面 解释问题

3.1 Receiver 分发详解.
(1) 可插拔的 ReceiverSchedulingPolicy 里面
其中,在 Receiver y 失效时,以前的 Spark Streaming 有可能会在 executor 1 上重启 Recever y ,而 1.5.0 以来,将在executor 2 上重启 Receiver y 。
应该是executor 3吧

读取多个topic数据效率问题

structed streaming读取kafka多个topic(topic数据源不一样),是通过直接指定subscribe=topic1,topic2,topic3的方式效率高,还是对每个topic都得到各种的Dataset[KafkaData]然后进行unoin后进行处理的效率高?

Executor运行一段时间以后Streaming程序失败

你好 @lw-lin
我们在使用Streaming的时候,发现Executor运行一段时间(1小时左右)后,整个程序就会失败,查看CPU,内存,网络,GC情况,都处于安全状态。

error:
java.lang.Exception: Could not compute split, block input-0-1416573258200 not found

最开始的Storage策略配置的是Memory_ONLY,当数据量激增的时候,会报这个错误,所以调整Storage的策略是Memory_And_DIsk,但是程序运行一段时间还是会报这个错误。同时,会抛出Executor和ReceiverTracker的通信超时(120s)。

请问这个有什么好的排查方法吗,谢谢。

ps:部署模式yarn-cluster

关于SparkStreaming的join操作

看到sparkStreaming官网上介绍的join

Here, in each batch interval, the RDD generated by stream1 will be joined with the RDD generated by stream2. You can also do leftOuterJoin, rightOuterJoin, fullOuterJoin. Furthermore, it is often very useful to do joins over windows of the streams. That is pretty easy as well.

具体的实现细节是说这个join只是的那个批次内的多个stream的join,暂时还无法做到跨批次的?
如果sparkstream暂时不能做到跨批次的join,那么若是我们自己做的话,一般的思路是怎样的?

spark streaming读取redis问题

自定义RedisZsetReceiver, 读取指定zset结构内的数据piped.zrange(key, 0, -1),该zset内数据只会新增,不会减少~
当我提交任务后,定时执行,spark streaming ui上每个batch time处理的record都不同,还不是递增状态?如何能达到想要的效果?目前的目标就是每个batch,就把redis zset结构内的数据全部读取出来;spark streaming这种适合我的场景吗?

driver端异常恢复, 如何确保exactly once语义的呢?

嗨, 大佬, 我有一个问题.
当一个jobSet, 有部分job已经执行成功, 此时, driver端异常退出.
那么, 恢复后, 针对这个jobSet, 还会执行那些已经成功的job吗?
如果不执行, 那么在源码中, 是如何体现的?
如果执行, 那么肯定就不遵守exactly once语义了, 那么我们平时说的streaming的exactly once语义, 又是如何理解呢?

structured streaming java.io.EOFException

structured streaming 程序运行一段时间会出现如下这个错误,请问是什么原因导致出现这个异常呢

User class threw exception: org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to stage failure: Task 123 in stage 1.0 failed 4 times, most recent failure: Lost task 123.3 in stage 1.0 (TID 175, ddn012075.heracles.sohuno.com, executor 1): java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:436)
at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:314)
at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:313)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:313)
at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:220)
at org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:186)
at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

batch duration,window duration以及sliding duration的关系

您能否抽时间讲一下这三个之间的关系,从doc上看貌似只是说window duration和sliding duration都应该设为batch duration的倍数,而job的submit到底是参照的batch duration还是sliding duration,请您为我解惑

请教问题

对于新部署的spark streaming有什么好的压测方案,谢谢

StateStore的实现以及exactly-once

根据介绍,默认的实现是将state存在hdfs,如果某个算子的某个分区的某个版本失败,那么会重新读取存档的分片数据,进行重写。但是如果在end端,如果没有幂等性和事务,一个分区的数据写入一部分后失败了,应该是会重试整个分片吧。那之前写入的那部分还是会出现重复。请问里面提到的end-to-end exactly-once 是怎么得来的呢?

这篇文档("0.1 Spark Streaming 实现思路与模块概述.md")存在描述错误的地方

如下:
DStream 和 RDD 的关系
既然 DStream 是 RDD 的模板,而且 DStream 和 RDD 具有相同的 transformation 操作,比如 map(), filter(), reduce() ……等等(正是这些相同的 transformation 使得 DStreamGraph 能够忠实记录 RDD DAG 的计算逻辑),那 RDD 和 DStream 有什么不一样吗?

此处描述有误:reduce()是action操作,而不是transformation操作

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.