high-performance-spark / high-performance-spark-examples Goto Github PK
View Code? Open in Web Editor NEWExamples for High Performance Spark
License: Other
Examples for High Performance Spark
License: Other
[error] Server access Error: Connection refused (Connection refused) url=http://repo.typesafe.com/typesafe/releases/org/apache/hadoop/hadoop-yarn/2.6.5/hadoop-yarn-2.6.5.jar
I am using home network, no proxy needed.
Add a mapPartitions example to GenerateScalingData
Our test coverage is super low. Try and improve test coverage.
We should port at least some of the examples to the Java API.
Add a simple benchmark to illustrate some of the differences between Spark Core & dataframes/datasets
One for sum of fuzzyness other for max panda id
Could be my own unfamiliarity with Scala, but I'm currently unable to build this project as-is using SBT. I'm getting the stack trace pasted below from IntelliJ (I get something similar when running 'sbt compile' from the CLI).
SBT 'high-performance-spark-examples' project refresh failed
Error:Error:Error while importing SBT project:<br/>...<br/><pre>[info] Resolving org.scala-sbt#apply-macro;0.13.13 ...
[info] Resolving org.spire-math#json4s-support_2.10;0.6.0 ...
[info] Resolving org.codehaus.plexus#plexus-component-annotations;1.5.5 ...
[info] Resolving javax.annotation#jsr250-api;1.0 ...
[info] Resolving com.thoughtworks.paranamer#paranamer;2.6 ...
[info] Resolving com.typesafe#config;1.2.0 ...
[info] Resolving org.scala-sbt#test-agent;0.13.13 ...
[info] Resolving org.scala-sbt#classfile;0.13.13 ...
[info] Resolving org.scala-sbt#completion;0.13.13 ...
[info] Resolving org.scala-sbt#test-interface;1.0 ...
[info] Resolving com.jcraft#jsch;0.1.50 ...
[info] Resolving org.scala-lang#scala-compiler;2.10.6 ...
[info] Resolving org.scala-sbt#interface;0.13.13 ...
[info] Resolving javax.inject#javax.inject;1 ...
[info] Resolving org.scala-sbt#logging;0.13.13 ...
[trace] Stack trace suppressed: run 'last *:ssExtractDependencies' for the full output.
[trace] Stack trace suppressed: run 'last *:update' for the full output.
[error] (*:ssExtractDependencies) sbt.ResolveException: download failed: org.mortbay.jetty#jetty;6.1.26!jetty.zip
[error] (*:update) sbt.ResolveException: download failed: org.mortbay.jetty#jetty;6.1.26!jetty.zip
[error] Total time: 19 s, completed Jul 10, 2017 10:47:35 AM</pre><br/>See complete log in <a href="file:/Users/adam/Library/Logs/IdeaIC2017.1/sbt.last.log">file:/Users/adam/Library/Logs/IdeaIC2017.1/sbt.last.log</a>
val r = SecondarySort.groupByKeyAndSortBySecondaryKey(data, 3)
val rSorted = r.collect().sortWith(lt = (a, b) => a._1.toDouble > b._1.toDouble)
assert(r.collect().zipWithIndex.forall {
case (((key, list), index)) => rSorted(index)._1.equals(key)
})
Actually r is not ordered, so it's not correct to compare r with rSorted
Example 3-4
it's difficult to work in intellij with sbt 0.13(it's a long time to "dump project structure from sbt"). plugin like sbt-spark-package seems not work in sbt 1.* .
where does the val list head :: rest
appear from in the following function ?
def groupSorted[K,S,V]( it : Iterator[((K, S), V)] ) : Iterator[(K, List[(S, V)])] = {
val res = List[ (K, ArrayBuffer[(S, V)]) ]()
it.foldLeft(res)(
(list, next) => list match {
case Nil => val ((firstKey, secondKey), value) = next
List((firstKey, ArrayBuffer((secondKey, value))))
case head :: rest =>
val (curKey, valueBuf) = head
val ((firstKey, secondKey), value) = next
if (!firstKey.equals(curKey) ) {
(firstKey, ArrayBuffer((secondKey, value))) :: list
} else {
valueBuf.append((secondKey, value))
list
}
}
).map { case (key, buf) => (key, buf.toList) }.iterator
}
It would be great if you guys can present working[if there is any] of push down filter's behavior if there is a filter applied on rowKey of Hbase backed hive table.
There seems to be a problem with https://github.com/high-performance-spark/high-performance-spark-examples/blob/master/src/main/scala/com/high-performance-spark-examples/ml/CustomPipeline.scala#L125. For example, if we try testing it thus
val indexer = new SimpleIndexer()
indexer.setInputCol("inputColumn")
indexer.setOutputCol("categoryIndex")
val model = indexer.fit(ds)
val predicted = model.transform(ds)
the indexer
has the inputCol set, but the model
(that is, object that's returned by indexer.fit
) does not. So when we try model.transform
, it complains that it "Failed to find a default value for inputCol".
Following https://stackoverflow.com/questions/40847625/spark-custom-estimator-access-to-paramt, @conorbmurphy and I were able to "solve" the problem by hard coding the column names within the class with
setDefault(inputCol, "inputColumn")
setDefault(outputCol, "categoryIndex")
but that's hardly the right solution. What should happen is the paramMap should be copied into the SimpleIndexerModel
when it's generated in the fit
method, but we can't figure out how to do that since paramMap
is protected.
Add dataset join examples.
object CoPartitioningLessons corresponding to book sample Example 6-8 Example 6-9. Both functions use two different Partitioner to show coLocate and copartition
We should port some of the examples to Python
Once travis-ci/apt-source-safelist#287 is resolved (or we can otherwise access R >= 3) in our travis env enable the SparkR tests.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.