Examples for High Performance Spark

License: Other

Scala 69.42% Shell 5.29% Python 8.22% CMake 0.83% C 0.34% Java 10.91% Fortran 0.04% Perl 2.60% R 1.24% Julia 0.13% C++ 0.98%

spark

high-performance-spark-examples's People

Contributors

Stargazers

Watchers

Forkers

holdenk svishnu88 mahmoudhanafy maplenvg mabidm jeperez meraboxer akzaidi gregnwosu phanther xbkaishui mindis kevintrannz cwpdntm joe2hpimn nikolayvoronchikhin saimaung rodrigoparada erdcpatel vaquarkhan akbarahmed andyyehoo 0xqq a414930249 said66 minhqnguyen gdtm86 maogautam manjunathgit lichaojacobs sudhanshu746 pborne kose-y linchin berngp zhousishuo wangzhiwubigdata alexanlee ram1991 witnesslq rcpbayindir daroza rubonz alexgids vedprakash-singh madhan1m jayinai imrepo devilsnare04 maddatascience naveen-8451 princessd8251 yunkillere pierrenowi jdsteja kali786516 harlixxy kingice nuthanreddy kanghojung-nbt phoenixofasia luojianp eminency hjw199089 xiarixiaoyao vishamdi vicchugu arwangasdaf chssunil srtata hanchen331 kaushiksekar eschizoid tianli9996 yinzhiqiang422 jamalahmedmaaz konu90 akxyz zyxue airob narasimhakattunga chaokunyang sirwz riordon kadiy1k xiufranklin lslab eazonyoung hongjie7202 javapro2000 richardqiao2000 choupijiang lidiyam veeraravi techsnob lusimba buildlackey imran273 mallik-g bayalla

high-performance-spark-examples's Issues

[error] Server access Error: Connection refused (Connection refused)


[error] Server access Error: Connection refused (Connection refused) url=http://repo.typesafe.com/typesafe/releases/org/apache/hadoop/hadoop-yarn/2.6.5/hadoop-yarn-2.6.5.jar

I am using home network, no proxy needed.

Add a mapPartitions example to GenerateScalingData

Improve test coverage

Our test coverage is super low. Try and improve test coverage.

Port examples to Java

We should port at least some of the examples to the Java API.

Add a benchmark to illustrate some Spark core v. Spark DataFrame v. Spark DataSets perf

Add a simple benchmark to illustrate some of the differences between Spark Core & dataframes/datasets

Add examples for Spark accumulators

One for sum of fuzzyness other for max panda id

Could be my own unfamiliarity with Scala, but I'm currently unable to build this project as-is using SBT. I'm getting the stack trace pasted below from IntelliJ (I get something similar when running 'sbt compile' from the CLI).

SBT 'high-performance-spark-examples' project refresh failed
    Error:Error:Error while importing SBT project:<br/>...<br/><pre>[info] Resolving org.scala-sbt#apply-macro;0.13.13 ...
[info] Resolving org.spire-math#json4s-support_2.10;0.6.0 ...
[info] Resolving org.codehaus.plexus#plexus-component-annotations;1.5.5 ...
[info] Resolving javax.annotation#jsr250-api;1.0 ...
[info] Resolving com.thoughtworks.paranamer#paranamer;2.6 ...
[info] Resolving com.typesafe#config;1.2.0 ...
[info] Resolving org.scala-sbt#test-agent;0.13.13 ...
[info] Resolving org.scala-sbt#classfile;0.13.13 ...
[info] Resolving org.scala-sbt#completion;0.13.13 ...
[info] Resolving org.scala-sbt#test-interface;1.0 ...
[info] Resolving com.jcraft#jsch;0.1.50 ...
[info] Resolving org.scala-lang#scala-compiler;2.10.6 ...
[info] Resolving org.scala-sbt#interface;0.13.13 ...
[info] Resolving javax.inject#javax.inject;1 ...
[info] Resolving org.scala-sbt#logging;0.13.13 ...
[trace] Stack trace suppressed: run 'last *:ssExtractDependencies' for the full output.
[trace] Stack trace suppressed: run 'last *:update' for the full output.
[error] (*:ssExtractDependencies) sbt.ResolveException: download failed: org.mortbay.jetty#jetty;6.1.26!jetty.zip
[error] (*:update) sbt.ResolveException: download failed: org.mortbay.jetty#jetty;6.1.26!jetty.zip
[error] Total time: 19 s, completed Jul 10, 2017 10:47:35 AM</pre><br/>See complete log in <a href="file:/Users/adam/Library/Logs/IdeaIC2017.1/sbt.last.log">file:/Users/adam/Library/Logs/IdeaIC2017.1/sbt.last.log</a>

QuantileOnlyArtisanalTest test("Secondary Sort") error

val r    = SecondarySort.groupByKeyAndSortBySecondaryKey(data, 3)
val rSorted = r.collect().sortWith(lt = (a, b) => a._1.toDouble > b._1.toDouble)
    assert(r.collect().zipWithIndex.forall {
      case (((key, list), index)) => rSorted(index)._1.equals(key)
    })

Actually r is not ordered, so it's not correct to compare r with rSorted

Add a selectExplode example for chapter 3

Example 3-4

port to SBT 1.*

it's difficult to work in intellij with sbt 0.13(it's a long time to "dump project structure from sbt"). plugin like sbt-spark-package seems not work in sbt 1.* .

Build assembly jar with travis

The feedback on a code bug at /goldilocks/GoldilocksSecondarySort.scala

where does the val list head :: rest appear from in the following function ?

def groupSorted[K,S,V]( it : Iterator[((K, S), V)] ) : Iterator[(K, List[(S, V)])] = {
        val res = List[ (K, ArrayBuffer[(S, V)]) ]()
        it.foldLeft(res)(
            (list, next) => list match {
                case Nil => val ((firstKey, secondKey), value) = next
                            List((firstKey, ArrayBuffer((secondKey, value))))
                case head :: rest => 
                       val (curKey, valueBuf) = head
                       val ((firstKey, secondKey), value) = next
                       if (!firstKey.equals(curKey) ) {
                            (firstKey, ArrayBuffer((secondKey, value))) :: list
          	        } else {
                            valueBuf.append((secondKey, value))
                            list
                        }
      	    }
        ).map { case (key, buf) => (key, buf.toList) }.iterator
  }

Add more documentation and examples

Behavior of spark's predicate push down for Hbase backed Hive tables

It would be great if you guys can present working[if there is any] of push down filter's behavior if there is a filter applied on rowKey of Hbase backed hive table.

Add a custom optimizer rule

Failed to find a default value for inputCol

There seems to be a problem with https://github.com/high-performance-spark/high-performance-spark-examples/blob/master/src/main/scala/com/high-performance-spark-examples/ml/CustomPipeline.scala#L125. For example, if we try testing it thus

    val indexer = new SimpleIndexer()
    indexer.setInputCol("inputColumn")
    indexer.setOutputCol("categoryIndex")
    val model = indexer.fit(ds)
    val predicted = model.transform(ds)

the indexer has the inputCol set, but the model (that is, object that's returned by indexer.fit) does not. So when we try model.transform, it complains that it "Failed to find a default value for inputCol".

Following https://stackoverflow.com/questions/40847625/spark-custom-estimator-access-to-paramt, @conorbmurphy and I were able to "solve" the problem by hard coding the column names within the class with

setDefault(inputCol, "inputColumn")
setDefault(outputCol, "categoryIndex")

but that's hardly the right solution. What should happen is the paramMap should be copied into the SimpleIndexerModel when it's generated in the fit method, but we can't figure out how to do that since paramMap is protected.

high-performance-spark / high-performance-spark-examples Goto Github PK

high-performance-spark-examples's People

Contributors

Stargazers

Watchers

Forkers

high-performance-spark-examples's Issues

Recommend Projects

Recommend Topics

Recommend Org