Giter Club home page Giter Club logo

learning-spark's Introduction

buildstatus Examples for Learning Spark

Examples for the Learning Spark book. These examples require a number of libraries and as such have long build files. We have also added a stand alone example with minimal dependencies and a small build file in the mini-complete-example directory.

These examples have been updated to run against Spark 1.3 so they may be slightly different than the versions in your copy of "Learning Spark".

Requirements

  • JDK 1.7 or higher
  • Scala 2.10.3
  • scala-lang.org
  • Spark 1.3
  • Protobuf compiler
  • On debian you can install with sudo apt-get install protobuf-compiler
  • R & the CRAN package Imap are required for the ChapterSixExample
  • The Python examples require urllib3

Python examples

From spark just run ./bin/pyspark ./src/python/[example]

Spark Submit

You can also create an assembly jar with all of the dependencies for running either the java or scala versions of the code and run the job with the spark-submit script

./sbt/sbt assembly OR mvn package cd $SPARK_HOME; ./bin/spark-submit --class com.oreilly.learningsparkexamples.[lang].[example] ../learning-spark-examples/target/scala-2.10/learning-spark-examples-assembly-0.0.1.jar

Learning Spark

learning-spark's People

Contributors

aalkilani avatar cesararevalo avatar holdenk avatar huydx avatar mrt avatar tabdulradi avatar vidaha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

learning-spark's Issues

Difference between Running spark as local[*] Vs Yarn-client Vs Yarn-cluster in terms of performance

Kindly consider this as an inquiry if not an issue.

Hi ,
I am evaluating Spark to use here at my work.

We have an existing Hortonworks HDP 2.3 install.

I am trying to work out whether I should use local or client or cluster to submit a job in Spark.

Consider I am running my job as :
sudo -u hdfs spark-submit --class "org.xyz.Spark_ES_Java_V4" --master "local[*]" target/xyz-1.1-jar-with-dependencies.jar 192.168.0.185 55555 > prashant.txt

In this I am able to do the task in 14 Sec.

When I run the same like
sudo -u hdfs spark-submit --class "org.xyz.Spark_ES_Java_V4" --master "yarn-client" target/xyz-1.1-jar-with-dependencies.jar 192.168.0.185 55555 > prashant.txt

It takes 16 Second

And this one
sudo -u hdfs spark-submit --class "org.xyz.Spark_ES_Java_V4" --master "yarn-cluster" target/xyz-1.1-jar-with-dependencies.jar 192.168.0.185 55555 > prashant.txt

Takes 18 Second.

As in first case I am running it locally means its running on one machine and taking less time where as in later caseI am submitting the job to cluster with 4 node.

So can anyone let me know what is the use of running the same in cluster as I am getting performance degrade with cluster.
Or if any way is there where I can enhance the performance with cluster.

Would love to hear from someone regarding this very urgently.

~Prashant

Example 3-13 TypeError

In the process of putting together an IPython Notebook with convenient worked examples from Learning Spark, I found a simple python semantic error.

Example 3-15 says:

print "Input had " + badLinesRDD.count() + " concerning lines"

which results in

TypeError                                 Traceback (most recent call last)
<ipython-input-10-078b22c97d4b> in <module>()
 ----> 1 print "Input had " + badLinesRDD.count() + " concerning lines"
  2 print "Here are 10 examples:"
  3 for line in badLinesRDD.take(10):
  4     print line

TypeError: cannot concatenate 'str' and 'int' objects

It should say something more like this:

print "Input had %d worrisome lines" % (badLinesRDD.count())

I made a gist of an ipython notebook showing the problem and the fix, with simple worked examples, which you can see (and download) here:

http://nbviewer.ipython.org/gist/nealmcb/b6d989a83adddcdd459f

I suggest including such notebooks in future editions.

sbt assembly - FileNotFoundException (Interrupted system call)

trying to do sbt assembly, it gets partway through and then gives a ton of FileNotFoundExceptions, with (Interrupted system call). This output show the first couple:

...
[info] Including: chill-java-0.5.0.jar
[info] Including: spark-cassandra-connector-java_2.10-1.0.0-rc5.jar
[info] Including: scopt_2.10-3.2.0.jar
scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown during a parallel computation: java.io.FileNotFoundException:     /Users/kcpaul/projects/learning-spark/target/streams/$global/assemblyOption/$global/streams/assembly/be9fe88db32697a9ad164c2f7c93714b1afaa465_2dc57d2    aeaf2dee095ca4149dbe0163f2d66a845.jarName (Interrupted system call)
java.io.FileInputStream.open(Native Method)
java.io.FileInputStream.<init>(FileInputStream.java:146)
sbt.Using$$anonfun$fileInputStream$1.apply(Using.scala:73)
sbt.Using$$anonfun$fileInputStream$1.apply(Using.scala:73)
sbt.Using$$anon$2.openImpl(Using.scala:65)
sbt.OpenFile$class.open(Using.scala:43)
sbt.Using$$anon$2.open(Using.scala:64)
sbt.Using$$anon$2.open(Using.scala:64)
sbt.Using.apply(Using.scala:23)
sbt.IO$.transfer(IO.scala:252)
.
.
.

java.io.FileNotFoundException: /Users/kcpaul/.ivy2/cache/org.json4s/json4s-core_2.10/jars/json4s-core_2.10-3.2.10.jar (Interrupted system call)
java.io.FileInputStream.open(Native Method)
java.io.FileInputStream.<init>(FileInputStream.java:146)
sbt.Using$$anonfun$fileInputStream$1.apply(Using.scala:73)
sbt.Using$$anonfun$fileInputStream$1.apply(Using.scala:73)
sbt.Using$$anon$2.openImpl(Using.scala:65)
sbt.OpenFile$class.open(Using.scala:43)
sbt.Using$$anon$2.open(Using.scala:64)
sbt.Using$$anon$2.open(Using.scala:64)
sbt.Using.apply(Using.scala:23)
sbtassembly.AssemblyUtils$.unzip(AssemblyUtils.scala:37)
.
.
.

java.io.FileNotFoundException: /Users/kcpaul/.sbt/boot/scala-2.10.4/lib/scala-reflect.jar (Interrupted system call)
java.io.FileInputStream.open(Native Method)
java.io.FileInputStream.<init>(FileInputStream.java:146)
sbt.Using$$anonfun$fileInputStream$1.apply(Using.scala:73)
sbt.Using$$anonfun$fileInputStream$1.apply(Using.scala:73)
sbt.Using$$anon$2.openImpl(Using.scala:65)
sbt.OpenFile$class.open(Using.scala:43)
sbt.Using$$anon$2.open(Using.scala:64)
sbt.Using$$anon$2.open(Using.scala:64)
sbt.Using.apply(Using.scala:23)
sbtassembly.AssemblyUtils$.unzip(AssemblyUtils.scala:37)
.
.
.

I found this where it says "this is caused by missing particular dependencies that may be excluded directly or indirectly in a build file. Just fix this and it is resolved."

Example 4-12 - PerKeyAvg for Python Incorrect

In the example, the map method shows to take a lambda with two parameters (key and xy), but it appears as though the python version of spark only has a map method that expects a lambda with just a single parameter.

So instead of the following

r = sumCount.map(lambda key, xy: (key, xy[0]/xy[1])).collectAsMap()

We should use

 r = sumCount.map( lambda kvp: ( kvp[0], kvp[1][0] / kvp[1][1] ) ).collectAsMap()

Manually adding scala.binary.version to Maven pom file

I had an issue during building the project due to missing property: scala.binary.version, which is used for:

<dependency>
      <groupId>org.scalatest</groupId>
      <artifactId>scalatest_${scala.binary.version}</artifactId>
      <version>2.2.1</version>
</dependency>

The property doesn't appear to be set in the pom file. I fixed the issue by adding my Scala version. Should this be added by default? I didn't see any documentation that this needs to be set, or am I doing it wrong?:

<properties>
    <java.version>1.7</java.version>
    <scala.binary.version>2.10</scala.binary.version>
 </properties>

Maven dependencies incorrect

I had issues building the learning-spark-master Maven project due to missing artifacts. I fixed this by modifying the following dependencies (just adding "_2.10" to the end) to match what's in the Central Maven repository:

    <dependency> <!-- Spark dependency -->
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-mllib_2.10</artifactId>
      <version>1.3.1</version>
    </dependency>
    <dependency> <!-- Cassandra -->
      <groupId>com.datastax.spark</groupId>
      <artifactId>spark-cassandra-connector_2.10</artifactId>
      <version>1.0.0-rc5</version>
    </dependency>
    <dependency> <!-- Cassandra -->
      <groupId>com.datastax.spark</groupId>
      <artifactId>spark-cassandra-connector-java_2.10</artifactId>
      <version>1.0.0-rc5</version>
    </dependency>

Example 4-17 incorrect

Two issues:

  1. The initialization syntax is incorrect: { (Store(...) ...}
  2. No code is provided for the Store class. Creating a case class Store(name: String) doesn't work properly. The joins don't work with Store objects as keys. I suspect one needs to override the hashCode method.

Any suggestions.

Task not serializable

Using the Jupyter notebook with Apache Toree as kernel

import play.api.libs.json._

case class Person(name: String, lovesPandas: Boolean)
implicit val personFormat = Json.format[Person]

val text = """{"name":"Sparky The Bear", "lovesPandas":true}"""

val jsonParse = Json.parse(text)
val result = Json.fromJson[Person](jsonParse)
result.get

is working while

import org.apache.spark._
import play.api.libs.json._
import play.api.libs.functional.syntax._

case class Person(name: String, lovesPandas: Boolean)
implicit val personReads = Json.format[Person]

val text = """{"name":"Sparky The Bear", "lovesPandas":true}"""

val input = sc.parallelize(List(text))
val parsed = input.map(Json.parse(_))
val result = parsed.flatMap(record => {    
    personReads.reads(record).asOpt
})
result.filter(_.lovesPandas).map(Json.toJson(_)).saveAsTextFile("files/out/pandainfo.json")

returns with

Name: org.apache.spark.SparkException
Message: Task not serializable
StackTrace: org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
[...]

Seems to be there is a general problem...?

sbt assembly error

I downloaded the usb.zip and am following the OReilly video course Introduction to Apache Spark. There are some places where the presentation is out of sync with the usb.zip contents but this is where it really hurts - I cant seem to compile the examples with the instruction provided in the video, that is, run the sbt.cmd file with the parameter assembly. Is there a fix for this? This is the error I get.

D:\Spark\usb\spark-training\sbt>sbt assembly
[info] Set current project to sbt (in build file:/D:/Spark/usb/spark-training/sbt/)
[error] Not a valid command: assembly
[error] Not a valid project ID: assembly
[error] Expected ':' (if selecting a configuration)
[error] Not a valid key: assembly
[error] assembly
[error] ^

D:\Spark\usb\spark-training\sbt>

'sbt package' fails due to schema proto files

sbt package

Compiling schema /databricks/learning-spark-master/src/main/protobuf/places.proto
java.lang.RuntimeException: error occured while compiling protobuf files: Cannot run program "protoc": error=2, No such file or directory

Caused by: java.io.IOException: Cannot run program "protoc": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)

Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:248)

error error occured while compiling protobuf files: Cannot run program "protoc": error=2, No such file or directory

Doesn't even run...

I would like to run BasicLoadJson.java, but this does not work. I cloned the repo, maven clean build it and tried to run it, but I get the error

Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/api/java/function/FlatMapFunction
    at java.lang.Class.getDeclaredMethods0(Native Method)
    at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
    at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
    at java.lang.Class.getMethod0(Class.java:3018)
    at java.lang.Class.getMethod(Class.java:1784)
    at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
    at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.api.java.function.FlatMapFunction
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

Why is that? What can I do?

Spark RDD pipe example got "/usr/bin/env: RScript: No such file or directory" error

I ran PipeExample with master = local using original codes and I got
java.io.IOException: Cannot run program "/tmp/spark-8db4969b-f76c-4377-b40a-acb4c6c6a7bb/userFiles-988a7579-afc7-43be-ad86-dcd98861f269/home/dev/projects/samples/learning-spark-master/src/R/finddistance.R": error=2, No such file or directory

I think this is a mistake. We should give the file name instead of full file path to SparkFiles.get. Spark will locate the file in the designated tmp folder for us. Therefore, I rewrite codes followingChapterSixExample as the below and I still got errors
16/01/01 17:16:17 INFO Utils: /home/dev/projects/samples/spark-tutorial/src/R/finddistance.R has been previously copied to /tmp/spark-8a56a6c2-32cc-4ec4-ace4-bfd62c796881/userFiles-6c9e6b58-3d70-49f4-bb6b-7f55178adabc/finddistance.R
/usr/bin/env: RScript: No such file or directory
There should not any permission issue for tmp folder. However, I cannot find that specific folder. Any idea, how I can resolve this issue and get either PipeExample or ChapterSixExample to work. Thanks

val pwd = System.getProperty("user.dir")
val distScript = pwd + "/src/R/finddistance.R"
val distScriptName = "finddistance.R"
sc.addFile(distScript)
val piped = rdd.pipe(Seq(SparkFiles.get(distScriptName))

build failure due to protocol buffer

Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:248)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
at sbt.SimpleProcessBuilder.run(ProcessImpl.scala:349)
at sbt.AbstractProcessBuilder.run(ProcessImpl.scala:128)
at sbt.AbstractProcessBuilder$$anonfun$runBuffered$1.apply(ProcessImpl.scala:159)
at sbt.AbstractProcessBuilder$$anonfun$runBuffered$1.apply(ProcessImpl.scala:159)
at sbt.BufferedLogger.buffer(BufferedLogger.scala:25)
at sbt.AbstractProcessBuilder.runBuffered(ProcessImpl.scala:159)
at sbt.AbstractProcessBuilder.$bang(ProcessImpl.scala:156)
at sbtprotobuf.ProtobufPlugin$$anonfun$protobufSettings$6$$anonfun$apply$1.apply(ProtobufPlugin.scala:27)
at sbtprotobuf.ProtobufPlugin$$anonfun$protobufSettings$6$$anonfun$apply$1.apply(ProtobufPlugin.scala:27)
at sbtprotobuf.ProtobufPlugin$.executeProtoc(ProtobufPlugin.scala:66)
at sbtprotobuf.ProtobufPlugin$.sbtprotobuf$ProtobufPlugin$$compile(ProtobufPlugin.scala:81)
at sbtprotobuf.ProtobufPlugin$$anonfun$sourceGeneratorTask$1$$anonfun$5.apply(ProtobufPlugin.scala:107)
at sbtprotobuf.ProtobufPlugin$$anonfun$sourceGeneratorTask$1$$anonfun$5.apply(ProtobufPlugin.scala:106)
at sbt.FileFunction$$anonfun$cached$1.apply(Tracked.scala:235)
at sbt.FileFunction$$anonfun$cached$1.apply(Tracked.scala:235)
at sbt.FileFunction$$anonfun$cached$2$$anonfun$apply$3$$anonfun$apply$4.apply(Tracked.scala:249)
at sbt.FileFunction$$anonfun$cached$2$$anonfun$apply$3$$anonfun$apply$4.apply(Tracked.scala:245)
at sbt.Difference.apply(Tracked.scala:224)
at sbt.Difference.apply(Tracked.scala:206)
at sbt.FileFunction$$anonfun$cached$2$$anonfun$apply$3.apply(Tracked.scala:245)
at sbt.FileFunction$$anonfun$cached$2$$anonfun$apply$3.apply(Tracked.scala:244)
at sbt.Difference.apply(Tracked.scala:224)
at sbt.Difference.apply(Tracked.scala:200)
at sbt.FileFunction$$anonfun$cached$2.apply(Tracked.scala:244)
at sbt.FileFunction$$anonfun$cached$2.apply(Tracked.scala:242)
at sbtprotobuf.ProtobufPlugin$$anonfun$sourceGeneratorTask$1.apply(ProtobufPlugin.scala:109)
at sbtprotobuf.ProtobufPlugin$$anonfun$sourceGeneratorTask$1.apply(ProtobufPlugin.scala:104)
at scala.Function7$$anonfun$tupled$1.apply(Function7.scala:35)
at scala.Function7$$anonfun$tupled$1.apply(Function7.scala:34)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
at sbt.std.Transform$$anon$4.work(System.scala:63)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
at sbt.Execute.work(Execute.scala:235)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
error error occured while compiling protobuf files: Cannot run program "protoc": error=2, No such file or directory

java.lang.NoClassDefFoundError: org/apache/spark/SparkConf

package com.oreilly.learningsparkexamples.java;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class JavaSparkTest {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local").setAppName("My App");
JavaSparkContext sc = new JavaSparkContext(conf);
// JavaRDD rdd = sc.textFile("D:\工作\20171227\newdesc.xml");
// JavaRDD lineee = rdd.filter(line -> line.contains("desc"));
// System.out.println(lineee.count());
}
}

直接git到本地,然后依赖包也全部下载完毕了,但是始终报错找不到jar包:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/SparkConf
at com.oreilly.learningsparkexamples.java.JavaSparkTest.main(JavaSparkTest.java:12)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 6 more

工具是使用的IntelliJ Idea

sbt assembly error

After running "sbt assembly", I got the following error:

[error] learning-spark/src/main/scala/com/oreilly/learningsparkexamples/scala/BasicParseJsonWithJackson.scala:47: not found: type ioRecord
[error] Some(mapper.readValue(record, classOf[ioRecord]))
[error] ^
[error] learning-spark/src/main/scala/com/oreilly/learningsparkexamples/scala/BasicParseJsonWithJackson.scala:53: value lovesPandas is not a member of Nothing
[error] result.filter(.lovesPandas).map(mapper.writeValueAsString())
[error] ^
[error] two errors found
error Compilation failed
[error] Total time: 24 s, completed Jun 17, 2015 10:00:31 PM

Removing BasicParseJsonWithJackson.scala and recompiling would result in other errors related to protocol buffer. Has anyone successfully built the project? How did you do it?

Thanks!

sample 2.7

This code is does not work:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("My App")
sc = SparkContext(conf = conf)

how to read flattened type field from elasticsearch using pyspark connector?

After reading index from elasticsearch using pyspark in databricks - All fields appeared but only one field which is a type of 'flattened'. Is there any option that i need to include during read. Following is my snippet to read from index.

spark.read.format("org.elasticsearch.spark.sql")\
      .option("es.nodes", ",".join(db['nodes']))\
      .option("es.mapping.date.rich", "false")\
      .option("es.net.http.auth.user", 'abc')\
      .option("es.net.http.auth.pass", '123abc')\
      .load('indexname')

elasticsearch: 7.5.1
spark: 2.4.3

java.lang.ClassNotFoundException: com.oreilly.learningsparkexamples.mini.scala.WordCount

I run sbt clean package is successful. and then execute
spark-submit --class com.oreilly.learningsparkexamples.mini.scala.WordCount ./target/.. ./README.md ./wordcount
show ClassNotFoundException.
build.sbt

version := "0.0.1"   
scalaVersion := "2.10.6"   
// additional libraries   
libraryDependencies ++= Seq(   
  "org.apache.spark" %% "spark-core" % "2.2.0" % "provided"   
)    

spark version is 2.2.0

scala version is   
`Scala code runner version 2.10.6 -- Copyright 2002-2013, LAMP/EPFL` 

FTP example does not work...

i keep on getting
org.apache.commons.net.ftp.FTPConnectionClosedException: Connection closed without indication.
at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:298)
at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:495)
at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:537)
at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:586)
at org.apache.commons.net.ftp.FTP.quit(FTP.java:794)
at org.apache.commons.net.ftp.FTPClient.logout(FTPClient.java:788)
at org.apache.hadoop.fs.ftp.FTPFileSystem.disconnect(FTPFileSystem.java:151)
at org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:395)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
at org.apache.hadoop.fs.Globber.glob(Globber.java:248)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1642)
at org.apache.hadogop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)

Pls note i have tried also connecting to SEC FTP (publicly available using anonymous:youremailaddress and i am still receiving issues....
could you please assist in guiding me on opening a FTP file using spark?
my email address is [email protected]

kind regards

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.