tribbloid / ispark Goto Github PK

View Code? Open in Web Editor NEW

105.0 17.0 29.0 154 KB

An Apache Spark-shell backend for IPython

License: Apache License 2.0

Scala 99.91% Shell 0.09%

ispark's Introduction

ISpark

ISpark is an Apache Spark-shell backend for IPython.

ISpark is ported from IScala, all credit goes to Mateusz Paprocki

Requirements

IPython 2.0+
Java JRE 1.7+

How it works

ISpark is a standard Spark Application that when submitted, its driver will maintain a three-way connection between IPython UI server and Spark cluster.

Powered By


Apache Spark	Apache Maven
Yourkit Java Profiler	Jupyter Notebook

Demo

Click me for a quick impression.

This environment is deployed on a Spark cluster with 4+ cores. It comes with no uptime guarantee and may not be accessible during maintenance.

Usage

ISpark only supports Native (Spark-shell) environment, support for Mahout DRM will be added upon request.

ISpark needs to be compiled and packaged into an uber jar by Maven before being submitted and deployed:

./mvn-install.sh
...
Building jar: ${PROJECT_DIR}/core/target/ispark-core-${PROJECT_VERSION}.jar
...

after which you can define a Spark profile for IPython by running:

$ ipython profile create spark

Then adding the following line into ~/.ipython/profile_spark/ipython_config.py:

import os
c = get_config()

SPARK_HOME = os.environ['SPARK_HOME']
# the above line can be replaced with: SPARK_HOME = '${INSERT_INSTALLATION_DIR_OF_SPARK}'
MASTER = '${INSERT_YOUR_SPARK_MASTER_URL}'

c.KernelManager.kernel_cmd = [SPARK_HOME+"/bin/spark-submit",
 "--master", MASTER,
 "--class", "org.tribbloid.ispark.Main",
 "--executor-memory", "2G",
#(only enable this line if you have extra jars) "--jars", "${FULL_PATHS_OF_EXTRA_JARS}",
 "${FULL_PATH_OF_MAIN_JAR}",
 "--profile", "{connection_file}",
 "--parent"]

c.NotebookApp.ip = '*' # only add this line if you want IPython-notebook being open to the public
c.NotebookApp.open_browser = False # only add this line if you want to suppress opening a browser after IPython-notebook initialization
c.NotebookApp.port = 8888

Congratulation! Now you can initialize ISpark CLI or ISpark-notebook by running:

ipython console --profile spark OR ipython notebook --profile spark

(Support for the data collection/enrichment engine SpookyStuff has been moved to an independent project: https://github.com/tribbloid/ISpooky.git)

Example

In [1]: sc
Out[1]: org.apache.spark.SparkContext@2cd972df

In [2]: sc.parallelize(1 to 10).map(v => v*v).collect.foreach(println(_))
Out[2]:
1
4
9
16
25
36
49
64
81
100

Magics

ISpark supports magic commands similarly to IPython, but the set of magics is different to match the specifics of Scala and JVM. Magic commands consist of percent sign % followed by an identifier and optional input to a magic. Magic command's syntax may resemble valid Scala, but every magic implements its own domain specific parser.

Type information

To infer the type of an expression use %type expr. This doesn't require evaluation of expr, only compilation up to typer phase. You can also get compiler's internal type trees with %type -v or %type --verbose.

In [1]: %type 1
Int

In [2]: %type -v 1
TypeRef(TypeSymbol(final abstract class Int extends AnyVal))

In [3]: val x = "" + 1
Out[3]: 1

In [4]: %type x
String

In [5]: %type List(1, 2, 3)
List[Int]

In [6]: %type List("x" -> 1, "y" -> 2, "z" -> 3)
List[(String, Int)]

In [7]: %type List("x" -> 1, "y" -> 2, "z" -> 3.0)
List[(String, AnyVal)]

In [8]: %type sc
SparkContext

Warning

Support for sbt-based library/dependency management has been removed due to its incompatibility with spark deployment requirement. if sbt is allowed to download new dependencies, using them in any distributed closure may compile but will throw ClassDefNotFoundErrors in runtime because they won't be submitted to Spark master. Users are encouraged to attach their jars using the "--jars" parameter of spark-submit.

License

Published under ASF License, see LICENSE.

ispark's People

Contributors

Stargazers

Watchers

ispark's Issues

ClassNotFoundException with class, defined in notebook

Hi, and thank you for your work
spark 1.3.1, yarn-client
to reproduce

case class Test(x: Int)
sc.parallelize((0 to 99).map{Test(_)}).map{(_,1)}.countByKey()

org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 26.0 failed 4 times, most recent failure: Lost task 3.3 in stage 26.0 (TID 165): java.lang.ClassNotFoundException: $iwC$$iwC$Test
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:270)
    at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:65)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
    at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:91)
    at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

How to render results as HTML elements

In python, we can define repr_html to print out the data as HTML elements, can I do this in ISpark?

This is very important for visualization.

Jianshi

Autocomplete is not working

When I press tab key in the notebook the following exception appears in the console:

Exception in thread "RequestsEventLoop" java.lang.RuntimeException: JSON deserialization error: JsResultException(errors:List((/line,List(ValidationError(error.path.missing,WrappedArray()))), (/text,List(ValidationError(error.path.missing,WrappedArray())))))
    at scala.sys.package$.error(package.scala:27)
    at scala.Predef$.error(Predef.scala:142)
    at org.tribbloid.ispark.Communication.recv(Communication.scala:77)
    at org.tribbloid.ispark.Main$EventLoop.run(Main.scala:165)

Question about macro in JsonImpl

Hello, I'm using the ispark in my project but I now have a problem with macro.

I use bazel to build my project, which uses scala 2.11, and it cannot comple macro in scala 2.10.

So, can i just modify the macro in JsMacroImpl.scala to function (see the following code)? I wonder if it will cause any problems?

package org.tribbloid.ispark.macros

import play.api.libs.json.{Format, JsMacroImpl => PlayMacroImpl, Reads, Writes}

import scala.language.experimental.macros
import scala.reflect.macros.Context

trait JsonImpl {
    def reads[A]:  Reads[A]  = macro PlayMacroImpl.readsImpl[A]
    def writes[A]: Writes[A] = macro JsMacroImpl.sealedWritesImpl[A]
    def format[A]: Format[A] = macro PlayMacroImpl.formatImpl[A]
}

Compare and contrast to spark-notebook and Zeppelin

There are a number of notebooks out there that support Spark. What is the difference between this one and, for instance, spark-notebook and Zeppelin? I think this should be a section in the README.

Illegal character in path in ISpark jar file

After I used maven twice (first time failed) to create the uber jar file, and setup IScala with spark backend, ipython notebook kernel died. I would like someone to help me find the illegal characters in the source code and fix it. The main error messages are the following:

Exception in thread "main" java.net.URISyntaxException: Illegal character in path at index 60: /home/username/ISpark/core/target/ispark-core-0.2.0-SNAPSHOT.jar}
at java.net.URI$Parser.fail(URI.java:2829)
at java.net.URI$Parser.checkChars(URI.java:3002)
at java.net.URI$Parser.parseHierarchical(URI.java:3086)
at java.net.URI$Parser.parse(URI.java:3044)
at java.net.URI.(URI.java:595)
at org.apache.spark.util.Utils$.resolveURI(Utils.scala:1506)
at org.apache.spark.util.Utils$$anonfun$resolveURIs$1.apply(Utils.scala:1530)
at org.apache.spark.util.Utils$$anonfun$resolveURIs$1.apply(Utils.scala:1530)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.util.Utils$.resolveURIs(Utils.scala:1530)
at org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArguments.scala:307)
at org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArguments.scala:220)
at org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:75)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:70)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

IPython3 error:'interp' is not a recognized option

Hi, I use ISpark in IPython3.0, the kernel.json is below:

{
    "display_name": "ISpark",
    "language": "scala",
    "argv": [
              "/opt/spark/spark-1.2.1/bin/spark-submit",
                "--master",
                "spark://x.xx.xxx.xxx:7077",
                "--total-executor-cores","3",
                "--class",
                "org.tribbloid.ispark.Main",
                "--executor-memory", "2G",
                "/pathto/ispark-core-assembly-0.2.0-SNAPSHOT.jar",
                "--profile",
                "{connection_file}",
                "--interp",
                "Spark",
                "--parent"
     ],
     "codemirror_mode": "scala"
}

while i use ISpark,it's logs:

Spark assembly has been built with Hive, including Datanucleus jars on classpath
Exception in thread "main" joptsimple.UnrecognizedOptionException: 'interp' is not a recognized option
    at joptsimple.OptionException.unrecognizedOption(OptionException.java:89)
    at joptsimple.OptionParser.handleLongOptionToken(OptionParser.java:429)
    at joptsimple.OptionParserState$2.handleArgument(OptionParserState.java:56)
    at joptsimple.OptionParser.parse(OptionParser.java:361)
    at org.tribbloid.ispark.Options.<init>(Options.scala:12)
    at org.tribbloid.ispark.Main$.main(Main.scala:192)
    at org.tribbloid.ispark.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Dose my kerner.json is wrong?

Shift + Tab operation not works in IPython 3.x

When you press shift tab when your console on a variable the following exception triggered and the kernel will freeze:

Exception in thread "RequestsEventLoop" java.lang.RuntimeException: JSON deserialization error: JsResultException(errors:List((/msg_type,List(ValidationError(Enumeration expected of type: class org.tribbloid.ispark.msg.MsgTypes$, but it does not appear to contain the value: inspect_request,WrappedArray())))))
at scala.sys.package$.error(package.scala:27)
at scala.Predef$.error(Predef.scala:142)
at org.tribbloid.ispark.Communication.recv(Communication.scala:77)
at org.tribbloid.ispark.Main$EventLoop.run(Main.scala:165)

Java Issue when run ipython

I installed ISpark and followed the documentation. However, when I run ipython, I got following error

Exception in thread "main" java.lang.NoClassDefFoundError: play/api/libs/json/Reads
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:274)
    at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:319)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: play.api.libs.json.Reads
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    ... 5 more

No such method error

Hi,

I'm getting the following error after upgrading to spark 1.3.0.

Any help is appreciated.

Exception in thread "main" java.lang.NoSuchMethodError: joptsimple.OptionSpecBuilder.forHelp()Ljoptsimple/AbstractOptionSpec;
at org.tribbloid.ispark.Options.(Options.scala:8)
at org.tribbloid.ispark.Main$.main(Main.scala:193)
at org.tribbloid.ispark.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Updgrade to Spark 1.3.0

Hi,
Is there any plan to upgrade ISpark to spark 1.2 or upcoming spark 1.3 version

port to jupyter-scala

It would be nice if this work could be ported to https://github.com/alexarchambault/jupyter-scala, so it can be loaded as a tool

Compiler accessed before init set up on Mesos

I'm trying to use ISpark on my Mesos cluster. When I start a notebook I get a compiler accessed before init set up warning than it starts to hang.

My spark-defaults.conf:

# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

# Example:
# spark.master            spark://master:7077
# spark.eventLog.enabled  true
# spark.eventLog.dir      hdfs://namenode:8021/directory
# spark.serializer        org.apache.spark.serializer.KryoSerializer
spark.master              mesos://zk://bdas-master-1:2181,bdas-master-2:2181,bdas-master-3:2181/mesos
spark.executor.uri        hdfs://bdas/opt/spark-1.0.2-bin-hadoop2.tgz
spark.executor.memory     2G

Console log

[krisz:~/workspace/bdas/spark-1.0.2-bin-hadoop2] [ml] 4m27s $ ipython notebook --profile spark --debug
2014-08-19 12:02:05.546 [NotebookApp] Config changed:
2014-08-19 12:02:05.546 [NotebookApp] {'Application': {'log_level': 10}, 'BaseIPythonApplication': {'profile': u'spark'}}
2014-08-19 12:02:05.546 [NotebookApp] IPYTHONDIR set to: /Users/krisz/.ipython
2014-08-19 12:02:05.547 [NotebookApp] Using existing profile dir: u'/Users/krisz/.ipython/profile_spark'
2014-08-19 12:02:05.547 [NotebookApp] Searching path [u'/Users/krisz/workspace/bdas/spark-1.0.2-bin-hadoop2', u'/Users/krisz/.ipython/profile_spark'] for config files
2014-08-19 12:02:05.548 [NotebookApp] Attempting to load config file: ipython_config.py
2014-08-19 12:02:05.548 [NotebookApp] Loaded config file: /Users/krisz/.ipython/profile_spark/ipython_config.py
2014-08-19 12:02:05.549 [NotebookApp] Config changed:
2014-08-19 12:02:05.549 [NotebookApp] {'Application': {'log_level': 10}, 'BaseIPythonApplication': {'profile': u'spark'}, 'NotebookApp': {'ip': '*', 'open_browser': False}, 'KernelManager': {'kernel_cmd': ['/Users/krisz/workspace/bdas/spark-1.0.2-bin-hadoop2/bin/spark-submit', '--class', 'org.tribbloid.ispark.Main', 'ispark-core-assembly-0.1.0-SNAPSHOT.jar', '--interp', 'Spark', '--parent']}}
2014-08-19 12:02:05.550 [NotebookApp] Attempting to load config file: ipython_notebook_config.py
2014-08-19 12:02:05.550 [NotebookApp] Loaded config file: /Users/krisz/.ipython/profile_spark/ipython_notebook_config.py
2014-08-19 12:02:05.553 [NotebookApp] Adding cluster profile 'default'
2014-08-19 12:02:05.554 [NotebookApp] Adding cluster profile 'julia'
2014-08-19 12:02:05.555 [NotebookApp] Adding cluster profile 'scala'
2014-08-19 12:02:05.555 [NotebookApp] Adding cluster profile 'spark'
2014-08-19 12:02:05.556 [NotebookApp] Adding cluster profile 'ssh'
2014-08-19 12:02:05.557 [NotebookApp] searching for local mathjax in [u'/Users/krisz/.ipython/nbextensions']
2014-08-19 12:02:05.557 [NotebookApp] searching for local mathjax in [u'/Users/krisz/.ipython/profile_spark/static', '/Users/krisz/.virtualenvs/ml/lib/python2.7/site-packages/IPython/html/static']
2014-08-19 12:02:05.557 [NotebookApp] Using MathJax from CDN: https://cdn.mathjax.org/mathjax/latest/MathJax.js
2014-08-19 12:02:05.569 [NotebookApp] CRITICAL | WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended.
2014-08-19 12:02:05.569 [NotebookApp] CRITICAL | WARNING: The notebook server is listening on all IP addresses and not using authentication. This is highly insecure and not recommended.
2014-08-19 12:02:05.570 [NotebookApp] Serving notebooks from local directory: /Users/krisz/workspace/bdas/spark-1.0.2-bin-hadoop2
2014-08-19 12:02:05.570 [NotebookApp] 0 active kernels
2014-08-19 12:02:05.570 [NotebookApp] The IPython Notebook is running at: http://[all ip addresses on your system]:8888/
2014-08-19 12:02:05.570 [NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
DEBUG:tornado.access:200 GET /api/sessions?_=1408440601075 (::1) 2.70ms
DEBUG:tornado.access:200 GET /clusters?_=1408440601076 (::1) 8.15ms
DEBUG:tornado.access:200 GET /api/notebooks?_=1408440601077 (::1) 2.71ms
2014-08-19 12:02:13.657 [NotebookApp] Creating new notebook in /
2014-08-19 12:02:13.659 [NotebookApp] Autosaving notebook /Users/krisz/workspace/bdas/spark-1.0.2-bin-hadoop2/Untitled6.ipynb
DEBUG:tornado.access:201 POST /api/notebooks (::1) 5.00ms
DEBUG:tornado.access:200 GET /notebooks/Untitled6.ipynb (::1) 71.35ms
DEBUG:tornado.access:200 GET /api/notebooks/Untitled6.ipynb?_=1408442533870 (::1) 1.38ms
2014-08-19 12:02:14.290 [NotebookApp] Connecting to: tcp://127.0.0.1:53244
2014-08-19 12:02:14.290 [NotebookApp] Kernel started: 20e97813-d4f0-4dab-b9d3-5f2af63448b5
2014-08-19 12:02:14.290 [NotebookApp] Kernel args: {'extra_arguments': [u'--debug', u"--IPKernelApp.parent_appname='ipython-notebook'", u'--profile-dir', u'/Users/krisz/.ipython/profile_spark'], 'cwd': u'/Users/krisz/workspace/bdas/spark-1.0.2-bin-hadoop2'}
DEBUG:tornado.access:201 POST /api/sessions (::1) 21.15ms
DEBUG:tornado.access:200 GET /api/notebooks/Untitled6.ipynb/checkpoints (::1) 0.90ms
2014-08-19 12:02:14.360 [NotebookApp] Connecting to: tcp://127.0.0.1:53241
2014-08-19 12:02:14.367 [NotebookApp] Connecting to: tcp://127.0.0.1:53243
2014-08-19 12:02:14.367 [NotebookApp] Connecting to: tcp://127.0.0.1:53242
Spark assembly has been built with Hive, including Datanucleus jars on classpath
connect ipython with --existing /Users/krisz/workspace/bdas/spark-1.0.2-bin-hadoop2/profile-58791.json
14/08/19 12:02:16 INFO spark.SecurityManager: Changing view acls to: krisz
14/08/19 12:02:16 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(krisz)
14/08/19 12:02:16 INFO spark.HttpServer: Starting HTTP Server
14/08/19 12:02:16 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/08/19 12:02:16 INFO server.AbstractConnector: Started [email protected]:53312
2014-08-19 12:02:17.289 [NotebookApp] Polling kernel...
14/08/19 12:02:18 INFO spark.SecurityManager: Changing view acls to: krisz
14/08/19 12:02:18 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(krisz)
14/08/19 12:02:18 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/08/19 12:02:18 INFO Remoting: Starting remoting
14/08/19 12:02:18 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:53363]
14/08/19 12:02:18 INFO Remoting: Remoting now listens on addresses: [akka.tcp://[email protected]:53363]
14/08/19 12:02:18 INFO spark.SparkEnv: Registering MapOutputTracker
14/08/19 12:02:18 INFO spark.SparkEnv: Registering BlockManagerMaster
14/08/19 12:02:18 INFO storage.DiskBlockManager: Created local directory at /var/folders/09/5khc_t2d4yv4v993r2j18qpm0000gn/T/spark-local-20140819120218-2e06
14/08/19 12:02:18 INFO storage.MemoryStore: MemoryStore started with capacity 294.9 MB.
14/08/19 12:02:18 INFO network.ConnectionManager: Bound socket to port 53366 with id = ConnectionManagerId(192.168.1.96,53366)
14/08/19 12:02:18 INFO storage.BlockManagerMaster: Trying to register BlockManager
14/08/19 12:02:18 INFO storage.BlockManagerInfo: Registering block manager 192.168.1.96:53366 with 294.9 MB RAM
14/08/19 12:02:18 INFO storage.BlockManagerMaster: Registered BlockManager
14/08/19 12:02:18 INFO spark.HttpServer: Starting HTTP Server
14/08/19 12:02:18 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/08/19 12:02:18 INFO server.AbstractConnector: Started [email protected]:53367
14/08/19 12:02:18 INFO broadcast.HttpBroadcast: Broadcast server started at http://192.168.1.96:53367
14/08/19 12:02:18 INFO spark.HttpFileServer: HTTP File server directory is /var/folders/09/5khc_t2d4yv4v993r2j18qpm0000gn/T/spark-c48a3470-97da-4cb4-a190-760e7e62879f
14/08/19 12:02:18 INFO spark.HttpServer: Starting HTTP Server
14/08/19 12:02:18 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/08/19 12:02:18 INFO server.AbstractConnector: Started [email protected]:53368
14/08/19 12:02:18 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/08/19 12:02:18 INFO server.AbstractConnector: Started [email protected]:4040
14/08/19 12:02:18 INFO ui.SparkUI: Started SparkUI at http://192.168.1.96:4040
2014-08-19 12:02:18.854 java[58791:1903] Unable to load realm info from SCDynamicStore
14/08/19 12:02:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/08/19 12:02:19 INFO spark.SparkContext: Added JAR file:/Users/krisz/workspace/bdas/spark-1.0.2-bin-hadoop2/ispark-core-assembly-0.1.0-SNAPSHOT.jar at http://192.168.1.96:53368/jars/ispark-core-assembly-0.1.0-SNAPSHOT.jar with timestamp 1408442539288
2014-08-19 12:02:19,344:58791(0x113502000):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2014-08-19 12:02:19,344:58791(0x113502000):ZOO_INFO@log_env@716: Client environment:host.name=kszucs-mpb.local
2014-08-19 12:02:19,344:58791(0x113502000):ZOO_INFO@log_env@723: Client environment:os.name=Darwin
2014-08-19 12:02:19,344:58791(0x113502000):ZOO_INFO@log_env@724: Client environment:os.arch=13.3.0
2014-08-19 12:02:19,344:58791(0x113502000):ZOO_INFO@log_env@725: Client environment:os.version=Darwin Kernel Version 13.3.0: Tue Jun  3 21:27:35 PDT 2014; root:xnu-2422.110.17~1/RELEASE_X86_64
2014-08-19 12:02:19,344:58791(0x113502000):ZOO_INFO@log_env@733: Client environment:user.name=krisz
2014-08-19 12:02:19,344:58791(0x113502000):ZOO_INFO@log_env@741: Client environment:user.home=/Users/krisz
I0819 12:02:19.345005 323481600 sched.cpp:126] Version: 0.19.1
2014-08-19 12:02:19,345:58791(0x113502000):ZOO_INFO@log_env@753: Client environment:user.dir=/Users/krisz/workspace/bdas/spark-1.0.2-bin-hadoop2
2014-08-19 12:02:19,345:58791(0x113502000):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=bdas-master-1:2181,bdas-master-2:2181,bdas-master-3:2181 sessionTimeout=10000 watcher=0x113c635a6 sessionId=0 sessionPasswd=<null> context=0x7fc82c694da0 flags=0
2014-08-19 12:02:19,913:58791(0x10ddb3000):ZOO_INFO@check_events@1703: initiated connection to server [10.0.10.208:2181]
2014-08-19 12:02:20,061:58791(0x10ddb3000):ZOO_INFO@check_events@1750: session establishment complete on server [10.0.10.208:2181], sessionId=0x247ed99f7510007, negotiated timeout=10000
I0819 12:02:20.062433 328376320 group.cpp:310] Group process ((3)@192.168.1.96:53390) connected to ZooKeeper
I0819 12:02:20.062500 328376320 group.cpp:784] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0819 12:02:20.062536 328376320 group.cpp:382] Trying to create path '/mesos' in ZooKeeper
2014-08-19 12:02:20.290 [NotebookApp] Polling kernel...
I0819 12:02:20.357967 324554752 detector.cpp:135] Detected a new leader: (id='69')
I0819 12:02:20.358140 320978944 group.cpp:655] Trying to get '/mesos/info_0000000069' in ZooKeeper
I0819 12:02:20.499778 324018176 detector.cpp:377] A new leading master (UPID=master@10.0.10.124:5050) is detected
I0819 12:02:20.499939 326164480 sched.cpp:222] New master detected at master@10.0.10.124:5050
I0819 12:02:20.500052 326164480 sched.cpp:230] No credentials provided. Attempting to register without authentication
I0819 12:02:20.905535 326164480 sched.cpp:397] Framework registered with 20140819-092950-2081030154-5050-1728-0009
14/08/19 12:02:20 INFO mesos.MesosSchedulerBackend: Registered as framework ID 20140819-092950-2081030154-5050-1728-0009
14/08/19 12:02:21 WARN ispark.SparkInterpreter$$anon$1: Warning: compiler accessed before init set up.  Assuming no postInit code.
2014-08-19 12:02:23.290 [NotebookApp] Polling kernel...
Welcome to Scala 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
2014-08-19 12:02:26.289 [NotebookApp] Polling kernel...
2014-08-19 12:02:29.290 [NotebookApp] Polling kernel...
2014-08-19 12:02:32.290 [NotebookApp] Polling kernel...
2014-08-19 12:02:35.290 [NotebookApp] Polling kernel...

Any ideas?
Thanks in advance!

Unrecognized option error

Hi,

I'm getting an urecognized option error.

Spark assembly has been built with Hive, including Datanucleus jars on classpath
Error: Unrecognized option '--profile'.

It seems the options aren't being passed to the Main class, instead they are being passed to the Spark submit as arguments.

Have you noticed this before? I'm sure I'm doing something completely boneheaded here.

My profile file looks like :

import os
c = get_config()
SPARK_HOME = os.environ['SPARK_HOME']

MASTER = 'yarn'

c.KernelManager.kernel_cmd = [SPARK_HOME+"/bin/spark-submit",
"--master", MASTER,
"--executor-memory", "2G",
"--class", "org.tribbloid.ispark.Main",
"--jars", "/ISpark/core/target/ispark-core-0.2.0-SNAPSHOT",
"--profile", "{connection_file}",
"--interp", "Spark",
"--parent"]

c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8888

Can not Completer in spark1.4.0

Hi,All,
I compile ISpark with spark1.4.0, more feature work fine, but Completer is not working, when use short key: tab, and it will has error below, and then can not execute other code any more.

Exception in thread "RequestsEventLoop" java.lang.RuntimeException: JSON deserialization error: JsResultException(errors:List((/line,List(ValidationError(error.path.missing,WrappedArray()))), (/text,List(ValidationError(error.path.missing,WrappedArray())))))
    at scala.sys.package$.error(package.scala:27)
    at scala.Predef$.error(Predef.scala:142)
    at org.tribbloid.ispark.Communication.recv(Communication.scala:77)
    at org.tribbloid.ispark.Main$EventLoop.run(Main.scala:165)

Scala syntax highlights in the ipython 3.1.0-cbccb68

We found that scala syntax highlight works in ipython 3.0.0, but doesn't work in ipython 3.1.0-cbccb68
I think it caused by this commit in ipython ipython/ipython#6793

Task not serializable error with HashingTF

Hello,

Using org.apache.spark.mllib.feature.HashingTF in an RDD map gives Task Not Serializable error. Code:

import org.apache.spark.mllib.feature.HashingTF

val spamText = sc.textFile(dataLocation)
val tf = new HashingTF(numFeatures = 10000)
val spamFeatures = spamText.map(email => tf.transform(email.split(" ")))

org.apache.spark.SparkException: Task not serializable
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
org.apache.spark.SparkContext.clean(SparkContext.scala:1605)
org.apache.spark.rdd.RDD.map(RDD.scala:286)

Printing variable value in notebook when I set value to a variable

Printing variable value in notebook is very annoying when I set large data to a variable.
Actually freezes happen often.

For example:
val test = [LARGE DATA]

Support running on yarn cluster

Hi,

are there plans to support yarn cluster?

problem with private[spark] functions?

I'm running into a problem while executing the standard spark/graphX example in ISpark, see this notebook.

Using Spark 1.1.1 with "local[2]" master and IPython Notebook 2.3, I get the following error:

org.apache.spark.SparkException: Job aborted due to stage failure: ClassNotFound with classloader: org.apache.spark.executor.ExecutorURLClassLoader@b6c3ef9
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
scala.Option.foreach(Option.scala:236)
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
akka.actor.ActorCell.invoke(ActorCell.scala:456)
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
akka.dispatch.Mailbox.run(Mailbox.scala:219)
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

A similar error has already been brought up by @benjaminlaird also using ISpark, see his much simpler code.

I suppose this problem has to do with the ExecutorURLClassLoader class being private[spark] (see ExecutorURLClassLoader.scala)

Of course, all the code runs fine on the standard spark-shell. The same issue happens on the spark backend for IScala from @hvanhovell

Can't compile against 2.11 & 1.4.0

I get this error:

info] Compiling 20 Scala sources to /home/aelberg/isparkbuild/ISpark/core/target/scala/classes...
[error] /home/aelberg/isparkbuild/ISpark/core/src/main/scala/org/apache/spark/repl/SparkILoopExt.scala:43: overloaded method constructor SparkILoop with alternatives:
errorscala.tools.nsc.interpreter.SparkILoop
[error](in0: java.io.BufferedReader,out: scala.tools.nsc.interpreter.JPrintWriter)scala.tools.nsc.interpreter.SparkILoop
[error](in0: Option[java.io.BufferedReader],out: scala.tools.nsc.interpreter.JPrintWriter)scala.tools.nsc.interpreter.SparkILoop
[error] cannot be applied to (Option[java.io.BufferedReader], tools.nsc.interpreter.JPrintWriter, Option[String])
[error] ) extends SparkILoop(in0, out, master) {
[error] ^

Yes, I do seem to need 2.11 and 1.4.0. The reason is that the spark installation on my cluster is 1.4 and 2.11. Trying to build against any other combination (e.g., 1.3.1 and 2.10, 1.4.0 and 2.10) and the resulting jar produces this error when used:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/repl/SparkILoop
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:760)

Any suggestions? Help appreciated.

Kernel restart issue

I'm running brach IPython-3.x. When I restart the kernel in Jupyterhub, I frequently get a failure. Am I doing something wrong or is this an issue? I get this message:

"The kernel has died, and the automatic restart has failed. It is possible the kernel cannot be restarted. If you are not able to restart the kernel, you will still be able to save the notebook, but running code will no longer work until the notebook is reopened."

Here is the error in the log:

16/06/15 12:52:05 INFO Client: Application report for application_1465935646593_0014 (state: ACCEPTED)
16/06/15 12:52:06 INFO Client: Application report for application_1465935646593_0014 (state: ACCEPTED)
[W 2016-06-15 12:52:07.726 jim kernelmanager:130] Timeout waiting for kernel_info_reply: 1495b014-a6bd-4b2d-8923-7c2af0cff7fe
[E 2016-06-15 12:52:07.726 jim handlers:90] Exception restarting kernel
Traceback (most recent call last):
File "/opt/Anaconda3/lib/python3.5/site-packages/notebook/services/kernels/handlers.py", line 88, in post
yield gen.maybe_future(km.restart_kernel(kernel_id))
File "/opt/Anaconda3/lib/python3.5/site-packages/tornado/gen.py", line 1008, in run
value = future.result()
File "/opt/Anaconda3/lib/python3.5/site-packages/tornado/concurrent.py", line 232, in result
raise_exc_info(self._exc_info)
File "", line 3, in raise_exc_info
tornado.gen.TimeoutError: Timeout waiting for restart