Giter Club home page Giter Club logo

mobius's Introduction

Mobius development is deprecated and has been superseded by a more recent version '.NET for Apache Spark' from Microsoft (Website | GitHub) that runs on Azure HDInsight Spark, Amazon EMR Spark, Azure & AWS Databricks.

Mobius logo

Mobius: C# API for Spark

Mobius provides C# language binding to Apache Spark enabling the implementation of Spark driver program and data processing operations in the languages supported in the .NET framework like C# or F#.

For example, the word count sample in Apache Spark can be implemented in C# as follows :

var lines = sparkContext.TextFile(@"hdfs://path/to/input.txt");  
var words = lines.FlatMap(s => s.Split(' '));
var wordCounts = words.Map(w => new Tuple<string, int>(w.Trim(), 1))  
                      .ReduceByKey((x, y) => x + y);  
var wordCountCollection = wordCounts.Collect();  
wordCounts.SaveAsTextFile(@"hdfs://path/to/wordcount.txt");  

A simple DataFrame application using TempTable may look like the following:

var reqDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv");
var metricDataFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv");
reqDataFrame.RegisterTempTable("requests");
metricDataFrame.RegisterTempTable("metrics");
// C0 - guid in requests DataFrame, C3 - guid in metrics DataFrame  
var joinDataFrame = GetSqlContext().Sql(  
    "SELECT joinedtable.datacenter" +
         ", MAX(joinedtable.latency) maxlatency" +
         ", AVG(joinedtable.latency) avglatency " +
    "FROM (" +
       "SELECT a.C1 as datacenter, b.C6 as latency " +  
       "FROM requests a JOIN metrics b ON a.C0  = b.C3) joinedtable " +   
    "GROUP BY datacenter");
joinDataFrame.ShowSchema();
joinDataFrame.Show();

A simple DataFrame application using DataFrame DSL may look like the following:

// C0 - guid, C1 - datacenter
var reqDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv")  
                             .Select("C0", "C1");    
// C3 - guid, C6 - latency   
var metricDataFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv", ",", false, true)
                                .Select("C3", "C6"); //override delimiter, hasHeader & inferSchema
var joinDataFrame = reqDataFrame.Join(metricDataFrame, reqDataFrame["C0"] == metricDataFrame["C3"])
                                .GroupBy("C1");
var maxLatencyByDcDataFrame = joinDataFrame.Agg(new Dictionary<string, string> { { "C6", "max" } });
maxLatencyByDcDataFrame.ShowSchema();
maxLatencyByDcDataFrame.Show();

A simple Spark Streaming application that processes messages from Kafka using C# may be implemented using the following code:

StreamingContext sparkStreamingContext = StreamingContext.GetOrCreate(checkpointPath, () =>
    {
      var ssc = new StreamingContext(sparkContext, slideDurationInMillis);
      ssc.Checkpoint(checkpointPath);
      var stream = KafkaUtils.CreateDirectStream(ssc, topicList, kafkaParams, perTopicPartitionKafkaOffsets);
      //message format: [timestamp],[loglevel],[logmessage]
      var countByLogLevelAndTime = stream
                                    .Map(kvp => Encoding.UTF8.GetString(kvp.Value))
                                    .Filter(line => line.Contains(","))
                                    .Map(line => line.Split(','))
                                    .Map(columns => new Tuple<string, int>(
                                                          string.Format("{0},{1}", columns[0], columns[1]), 1))
                                    .ReduceByKeyAndWindow((x, y) => x + y, (x, y) => x - y,
                                                          windowDurationInSecs, slideDurationInSecs, 3)
                                    .Map(logLevelCountPair => string.Format("{0},{1}",
                                                          logLevelCountPair.Key, logLevelCountPair.Value));
      countByLogLevelAndTime.ForeachRDD(countByLogLevel =>
      {
          foreach (var logCount in countByLogLevel.Collect())
              Console.WriteLine(logCount);
      });
      return ssc;
    });
sparkStreamingContext.Start();
sparkStreamingContext.AwaitTermination();

For more code samples, refer to Mobius\examples directory or Mobius\csharp\Samples directory.

API Documentation

Refer to Mobius C# API documentation for the list of Spark's data processing operations supported in Mobius.

API Usage

Mobius API usage samples are available at:

  • Examples folder which contains standalone C# and F# projects that can be used as templates to start developing Mobius applications

  • Samples project which uses a comprehensive set of Mobius APIs to implement samples that are also used for functional validation of APIs

  • Mobius performance test scenarios implemented in C# and Scala for side by side comparison of Spark driver code

Documents

Refer to the docs folder for design overview and other info on Mobius

Build Status

Ubuntu 14.04.3 LTS Windows Unit test coverage
Build status Build status codecov.io

Getting Started

Windows Linux
Build & run unit tests Build in Windows Build in Linux
Run samples (functional tests) in local mode Samples in Windows Samples in Linux
Run examples in local mode Examples in Windows Examples in Linux
Run Mobius app
Run Mobius Shell Not supported yet

Useful Links

Supported Spark Versions

Mobius is built and tested with Apache Spark 1.4.1, 1.5.2, 1.6.* and 2.0.

Releases

Mobius releases are available at https://github.com/Microsoft/Mobius/releases. References needed to build C# Spark driver applicaiton using Mobius are also available in NuGet

NuGet Badge

Refer to mobius-release-info.md for the details on versioning policy and the contents of the release.

License

License

Mobius is licensed under the MIT license. See LICENSE file for full license information.

Community

Issue Stats Issue Stats Join the chat at https://gitter.im/Microsoft/Mobius Twitter

Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

mobius's People

Contributors

adamante avatar cosminboaca avatar cyruszhang avatar danielli90 avatar davidk avatar dwnichols avatar forki avatar guiling avatar hebinhuang avatar isaacabraham avatar jayjaywg avatar jthelin avatar kai-zeng avatar kspk avatar lqm678 avatar microsoft-github-policy-service[bot] avatar ms-guizha avatar myasuka avatar qintao1976 avatar rapoth avatar rdavisau avatar shakdoesgithub avatar shakibbzillow avatar skaarthik avatar slyons avatar sutyag avatar tanviraumi avatar tawan0109 avatar xiongrenyi avatar xsidurd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mobius's Issues

DataFrame API review - new in 1.6.0

FSharp API?

Would this project be open to supporting an FSharp API wrapper?

Something to make the it easier to write idiomatic FSharp code.

Interactive (REPL) support in SparkCLR (for integration with higher-level tools - Jupyter)

SparkCLR needs to support the interactive mode that enables the end-user to express computation/analysis as a set of C# code statements that get executed in the REPL (Read, Evaluate, Print, Loop) way. An end-user could then interact with SparkCLR via a prompt. Interactive mode also provides a stepping stone for integration with high-level tools - particularly Jupyter.

Summarizing design considerations below.
High-level architecture/design for Jupyter/SparkCLR integration.

• SparkCLR’s launch mechanism (sparkclr-submit.cmd) gets a flag (-i, interactive mode). In the interactive mode, SparkCLR (actually CSharpRunner) spawns a RESTful server that has the following REST-based APIs.

  1. Start Session – String Start(SparkConf conf)
  2. End Session – void End(string sessionId)
  3. Execute – ResultSet Execute(String sessionId, String code)

• Each SparkCLR session has an instance of IREPLEngine, the default implementation being scriptcs.
When a session is started, an instance of IREPLEngine is created that supports the Execute call (mirrors the REST-based Execute
API in terms of signature).
A predefined set of statements is executed that create an instance of SparkContext. This set of statements reflect the standard
way of creating SparkContext in SparkCLR. An IREPLEngine instance has an exclusive SparkContext handle. There can be multiple
active sessions though, each having its own handle to an IREPLEngine.

• An invocation of Execute(string sessionId, string code) receives a C# statement that is passed on to the IREPLEngine instance associated with the session (identified by sessionId) for execution. A statement may produce result that needs to be returned back to the caller. The result is encapsulated inside a ResultSet object which provides an iterator to traverse the individual items.

Integrating with Jupyter

Jupyter provides the notion of a “kernel” that is language specific and provides bindings for a given language. A C# kernel is required that allows C# code snippets to be collected from Jupyter’s interface and executed in the backend. ICSharp can be a starting point if not writing the kernel from scratch. The C# kernel has dependency on SparkCLR and invokes the SparkCLR.cmd launch script in the
interactive mode. Note that the SparkConfiguration (that determines which cluster user needs to connect with) is received from
Jupyter and passed on to the C# Kernel and eventually given to SparkCLR as part of arguments to the start script (sparkclr-submit.cmd).

• An end-user of Jupyter starts in auto-created session (the session is created by the C# Kernel using the RESTbased interface). By virtue of being in a session, the user has access to SparkContext instance that would be used for executing all statements entered as part of the session.
• A statement entered is passed verbatim to the C# Kernel that forwards it using the Execute method call. The Kernel is equipped with handing of the Result set object and presents the result (or exception) to Jupyter as per its protocol.

To summarize, the tasks can be bucketed into
SparkCLR
a) Building the RESTful Server that provides the 3 APIs
b) Modifying SparkCLR start script to accept interactive mode and launch the server
c) Building the C# Kernel (possibly on top of ICSharp, source code here)

and (b) can be reused to bridge with Zeppelin in future, if required.

get user confirmation before setting execution policy in RunSamples.cmd

Set-ExecutionPolicy is used in RunSamples.cmd and it changes the user preference for the Windows PowerShell execution policy. A warning has to be displayed to the user and execution policy should be updated only upon confirmation from the user. If it is possible to check the execution policy prior to doing anything regarding that, it would be preferable.

Send out email notification after Travis CI build completes

Currently no email notification is sent out after Travis CI build completes (at least no email for a passed build), it will be very helpful if Travis CI sends email notification to PR owner after each build, just like what Appveyor does.

Improve documentation comments on APIs

Some documentation comments on APIs were copied from Scala or Python. Some of them still contain Scala/Python specified comments tags (e.g, C{***} in scala). Need to revisit documentation on APIs.

Continuous integration for both Windows and Linux systems

In order to provide stable Linux compatibility for SparkCLR, we need to set up continuous integration for Linux. AppVeyor does not have Linux support. I have experimented with Travis CI, which worked well for SparkCLR on Linux. Travis CI uses xbuild for building, and supports NUnit for testing. It also support OS X, but it does not support Windows yet.

One option to our problem is to host our own Jenkins service, which does support both Windows and Linux. Another option may be having two CI tools at the same time.

The tasks of adding support for Linux can be divided into:

  1. Having a Linux version of the Build/Test/Clean scripts, including all Build.cmd's, Clean.cmd, RunSamples.cmd and Test.cmd.
  2. Setting up CI configuration

Unable to run samples in standalone mode.

I am currently trying to run the samples for SparkCLR and the local samples from within the localmode folder works great. However, when trying to execute the samples against my Spark server (cluster) using the sparkclr-submit.cmd script:

C:\MyData\Apache_Spark\SparkCLR-master\build\runtime>sparkclr-submit.cmd --verbose --master spark://spark01:7077 --exe SparkCLRSamples.exe %SPARKCLR_HOME%\samples spark.local.dir %SPARKCLR_HOME%\Temp sparkclr.sampledata.loc %SPARKCLR_HOME%\data

I am getting the following error:
The system cannot find the path specified.
SPARKCLR_JAR=spark-clr_2.10-1.6.0-SNAPSHOT.jar
Error: Could not find or load main class org.apache.spark.launcher.SparkCLRSubmitArguments

is this another environment variable that needs to be set? I have the SPARKCLR_HOME and JAVA_HOME variables. Are there more that are needed?

I am on the latest SparkCLR, downloaded a day ago.

Thanks all.

set executable name as app name

"org.apache.spark.deploy.csharp.CSharpRunner" shows as the app name in Spark UI. All SparkCLR apps will have the same name if app.name is not explicitly set. We could set exe name as the app name if it is provided explicitly.

@tawan0109 - I vaguely remember fixing this issue a while back. Did anything change recently?

Build.cmd failed when Nuget.config is modified by other application

I hit the build.cmd error in building csharp folder. Digging into it I found the "nuget restore" didn't pull the packages into the correct folder, instead, it pulled the packages into a shared folder. This is because I have a pacman main empty enlistment in the same computer, which changed the nuget.config as:

at C:\Users\rogeryu\AppData\Roaming\NuGet\NuGet.Config

This makes nuget to pull packages into c:\src\Cxcache.

This could be fixed by forcing the package folder in build.cmd:

nuget restore -PackagesDirectory packages

Build problem with VS2015 on VSO

The latest PR #38 seems to have broken my VSO build when using VSO's VS2015 (MsBuild 14) settings.

Build works ok when using VSO's VS2013 (MsBuild 12) settings, so i think it is similar problem to issue #8.

Paradoxically, my AppVeyor build also uses msbuild 14 and that works ok, so seems to be some specific problem with VSO environment :(
https://ci.appveyor.com/project/jthelin/sparkclr/build/1.4.1-SNAPSHOT.50

Sorry, i don't have VS2015 installed on any of my dev machines to suggest a PR fix for this issue.

csharp\Adapter\Microsoft.Spark.CSharp\Streaming\DStream.cs(427,38): Error CS0029: Cannot implicitly convert type 'T' to 'System.Collections.Generic.KeyValuePair<int, T>'

csharp\Adapter\Microsoft.Spark.CSharp\Streaming\DStream.cs(427,38): Error CS1662: Cannot convert lambda expression to intended delegate type because some of the return types in the block are not implicitly convertible to the delegate return type

cant run the samples

Error: Could not find or load main class org.apache.spark.launcher.SparkCLRSubmitArguments

em... can someone tell me why?

Switch to fluent assertion

Currently Assert() in Microsoft.VisualStudio.QualityTools.UnitTestFramework are used in samples and unit tests code. FluentAssertions (or N-Fluent) offers a fluent style assertion, helps pinpoint the cause of assertion much more efficient. See original comment by @jthelin suggested in #68. Create this issue as suggestions to adopt fluent assertion as we write/update assertion in samples and unit test code.

Issue Running PI example on Linux Standalone Cluster

I am able to run the samples locally, but when pointing to my Linux Cluster, I am unable to run the PI sample, as follows:

** Client Mode **
scripts\sparkclr-submit.cmd --proxy-user miadmin --total-executor-cores 2 --master spark://spark01:7077 --exe Pi.exe C:\MyData\Apache_Spark\SparkCLR-master\examples\pi\bin\Debug spark.local.dir %temp%

Getting the following error(s):
"C:\MyData\Apache_Spark\SparkCLR-master\build\tools\spark-1.6.0-bin-hadoop2.6\conf\spark-env.cmd"
SPARKCLR_JAR=spark-clr_2.10-1.6.0-SNAPSHOT.jar
Zip driver directory C:\MyData\Apache_Spark\SparkCLR-master\examples\pi\bin\Debug to C:\Users\shunley\AppData\Local\Temp\Debug_1453925538545.zip
[sparkclr-submit.cmd] Command to run --proxy-user miadmin --total-executor-cores 2 --master spark://spark01:7077 --name Pi --files C:\Users\shunley\AppData\Local\Temp\Debug_1453925538545.zip --class org.apache.spark.deploy.csharp.CSharpRunner C:\MyData\Apache_Spark\SparkCLR-master\build\runtime\lib\spark-clr_2.10-1.6.0-SNAPSHOT.jar C:\MyData\Apache_Spark\SparkCLR-master\examples\pi\bin\Debug C:\MyData\Apache_Spark\SparkCLR-master\examples\pi\bin\Debug\Pi.exe spark.local.dir C:\Users\shunley\AppData\Local\Temp
[CSharpRunner.main] Starting CSharpBackend!
[CSharpRunner.main] Port number used by CSharpBackend is 4485
[CSharpRunner.main] adding key=spark.jars and value=file:/C:/MyData/Apache_Spark/SparkCLR-master/build/runtime/lib/spark-clr_2.10-1.6.0-SNAPSHOT.jar to environment
[CSharpRunner.main] adding key=spark.app.name and value=Pi to environment
[CSharpRunner.main] adding key=spark.cores.max and value=2 to environment
[CSharpRunner.main] adding key=spark.files and value=file:/C:/Users/shunley/AppData/Local/Temp/Debug_1453925538545.zip to environment
[CSharpRunner.main] adding key=spark.submit.deployMode and value=client to environment
[CSharpRunner.main] adding key=spark.master and value=spark://spark01:7077 to environment
[2016-01-27T20:12:19.7218665Z] [SHUNLEY10] [Info] [ConfigurationService] ConfigurationService runMode is CLUSTER
[2016-01-27T20:12:19.7228674Z] [SHUNLEY10] [Info] [SparkCLRConfiguration] CSharpBackend successfully read from environment variable CSHARPBACKEND_PORT
[2016-01-27T20:12:19.7228674Z] [SHUNLEY10] [Info] [SparkCLRIpcProxy] CSharpBackend port number to be used in JvMBridge is 4485
[2016-01-27 15:12:19,866] [1] [DEBUG] [Microsoft.Spark.CSharp.Examples.PiExample] - spark.local.dir is set to C:\Users\shunley\AppData\Local\Temp
[2016-01-27 15:12:21,467] [1] [INFO ] [Microsoft.Spark.CSharp.Examples.PiExample] - ----- Running Pi example -----
collectAndServe on object of type NullObject failed
null
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.api.csharp.CSharpBackendHandler.handleMethodCall(CSharpBackendHandler.scala:153)
at org.apache.spark.api.csharp.CSharpBackendHandler.channelRead0(CSharpBackendHandler.scala:94)
at org.apache.spark.api.csharp.CSharpBackendHandler.channelRead0(CSharpBackendHandler.scala:27)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 9, spark02): java.io.IOException: Cannot run program "CSharpWorker.exe": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:161)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:87)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:63)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:134)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:101)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.api.csharp.CSharpRDD.compute(CSharpRDD.scala:62)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:187)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
... 15 more

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
... 25 more
Caused by: java.io.IOException: Cannot run program "CSharpWorker.exe": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:161)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:87)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:63)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:134)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:101)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.api.csharp.CSharpRDD.compute(CSharpRDD.scala:62)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:187)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
... 15 more
()
methods:
public static int org.apache.spark.api.python.PythonRDD.collectAndServe(org.apache.spark.rdd.RDD)
args:
argType: org.apache.spark.api.csharp.CSharpRDD, argValue: CSharpRDD[1] at RDD at PythonRDD.scala:43
[2016-01-27T20:12:28.0995397Z] [SHUNLEY10] [Error] [JvmBridge] JVM method execution failed: Static method collectAndServe failed for class org.apache.spark.api.python.PythonRDD when called with 1 parameters ([Index=1, Type=JvmObjectReference, Value=12], )
[2016-01-27T20:12:28.0995397Z] [SHUNLEY10] [Error] [JvmBridge] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 9, spark02): java.io.IOException: Cannot run program "CSharpWorker.exe": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:161)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:87)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:63)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:134)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:101)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.api.csharp.CSharpRDD.compute(CSharpRDD.scala:62)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:187)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
... 15 more

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.api.csharp.CSharpBackendHandler.handleMethodCall(CSharpBackendHandler.scala:153)
at org.apache.spark.api.csharp.CSharpBackendHandler.channelRead0(CSharpBackendHandler.scala:94)
at org.apache.spark.api.csharp.CSharpBackendHandler.channelRead0(CSharpBackendHandler.scala:27)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Cannot run program "CSharpWorker.exe": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:161)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:87)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:63)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:134)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:101)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.api.csharp.CSharpRDD.compute(CSharpRDD.scala:62)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:187)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
... 15 more

[2016-01-27T20:12:28.1296129Z] [SHUNLEY10] [Exception] [JvmBridge] JVM method execution failed: Static method collectAndServe failed for class org.apache.spark.api.python.PythonRDD when called with 1 parameters ([Index=1, Type=JvmObjectReference, Value=12], )
at Microsoft.Spark.CSharp.Interop.Ipc.JvmBridge.CallJavaMethod(Boolean isStatic, Object classNameOrJvmObjectReference, String methodName, Object[] parameters)
[2016-01-27 15:12:28,130] [1] [INFO ] [Microsoft.Spark.CSharp.Examples.PiExample] - ----- Error running Pi example (duration=00:00:06.6599877) -----
System.Exception: JVM method execution failed: Static method collectAndServe failed for class org.apache.spark.api.python.PythonRDD when called with 1 parameters ([Index=1, Type=JvmObjectReference, Value=12], )
at Microsoft.Spark.CSharp.Interop.Ipc.JvmBridge.CallJavaMethod(Boolean isStatic, Object classNameOrJvmObjectReference, String methodName, Object[] parameters)
at Microsoft.Spark.CSharp.Interop.Ipc.JvmBridge.CallStaticJavaMethod(String className, String methodName, Object[] parameters)
at Microsoft.Spark.CSharp.Proxy.Ipc.RDDIpcProxy.CollectAndServe()
at Microsoft.Spark.CSharp.Core.RDD1.Collect() at Microsoft.Spark.CSharp.Core.RDD1.Reduce(Func`3 f)
at Microsoft.Spark.CSharp.Examples.PiExample.Pi() in C:\MyData\Apache_Spark\SparkCLR-master\examples\Pi\Program.cs:line 76
at Microsoft.Spark.CSharp.Examples.PiExample.Main(String[] args) in C:\MyData\Apache_Spark\SparkCLR-master\examples\Pi\Program.cs:line 35
[2016-01-27 15:12:28,131] [1] [INFO ] [Microsoft.Spark.CSharp.Examples.PiExample] - Completed running examples. Calling SparkContext.Stop() to tear down ...
[2016-01-27 15:12:28,131] [1] [INFO ] [Microsoft.Spark.CSharp.Examples.PiExample] - If this program (SparkCLRExamples.exe) does not terminate in 10 seconds, please manually terminate java process launched by this program!!!
Requesting to close all call back sockets.
[CSharpRunner.main] closing CSharpBackend
Requesting to close all call back sockets.
[CSharpRunner.main] Return CSharpBackend code 1
Utils.exit() with status: 1, maxDelayMillis: 1000

I have a couple of questions as the documentation and the quickstart here: https://github.com/Microsoft/SparkCLR/wiki/Quick-Start , didn't really talk about it. I'm new to Spark, so I apologize in advance if these questions are "newbie-ish".

When the Quickstart says to use the following command for a standalone cluster environment:
cd \path\to\runtime
scripts\sparkclr-submit.cmd ^
--total-executor-cores 2 ^
--master spark://host:port ^
--exe Pi.exe ^
\path\to\Pi\bin[debug|release] ^
spark.local.dir %temp%

I understand the navigating to the runtime folder (locally or on the submitting server) part on the first line. I get specifying the master so it knows which spark cluster to run on (this is the remote spark cluster). Now, what is confusing here is are we still pointing to the local (windows) file system for the Pi executable and the temp directory?

We're currently looking to use Spark and SparkR to do our processing from our application and I am just trying to understand how your API interacts with Spark, submitting work, retrieving results, etc..

Any help getting the Cluster Samples up and running (Client and Cluster mode) would be greatly appreciated.

Thanks,

Scott

Update Pom.xml to support creating uber package as a separate goal

Pom.xml is ued to produce lean and mean SparkCLR package by default. This default Pom.xml supports debug mode in IDE like IntelliJ. For end-2-end testing and job submission, as a workaround, build.cmd modifies pom.xml at build-time to produce uber SparkCLR package which includes dependent packages (PR#127)

The workaround needs to be replaced by a separate goal in Pom.xml, explicitly.

Problem building SparkCLR code with msbuild VS2015 settings on VSO

Any clues to why this code won't build with msbuild 2015 on VSO?

I set up an automated build from github master branch on my VSO instance with default settings for the msbuild step, and strangely the build failed with the errors below.

I knew the code built ok for me on my local machine (which has VS2013 installed), so as an experiment I set the VSO build step to use "VS2013" setting instead, and strangely the VSO build succeeded!

I then switched the VSO build step back to the "VS2015" setting and got the original failure again :(

I haven't got VS2015 on any of my machines yet, but I would guess that interactive build might show the same problem?

Any thoughts?

2015-10-31T17:54:51.2733808Z ##[error]csharp\Adapter\Microsoft.Spark.CSharp\Core\RDD.cs(258,109): Error CS0029: Cannot implicitly convert type 'T' to 'System.Collections.Generic.KeyValuePair<T, int>'
2015-10-31T17:54:51.2743806Z      2>Core\RDD.cs(258,109): error CS0029: Cannot implicitly convert type 'T' to 'System.Collections.Generic.KeyValuePair<T, int>' [C:\a\1\s\csharp\Adapter\Microsoft.Spark.CSharp\Adapter.csproj]
2015-10-31T17:54:51.2743806Z ##[error]csharp\Adapter\Microsoft.Spark.CSharp\Core\RDD.cs(258,109): Error CS1662: Cannot convert lambda expression to intended delegate type because some of the return types in the block are not implicitly convertible to the delegate return type
2015-10-31T17:54:51.2753809Z      2>Core\RDD.cs(258,109): error CS1662: Cannot convert lambda expression to intended delegate type because some of the return types in the block are not implicitly convertible to the delegate return type [C:\a\1\s\csharp\Adapter\Microsoft.Spark.CSharp\Adapter.csproj]

Initialize Broadcast variables properly when SparkCLR restores from checkpoint

When checkpoint is enabled and broadcast variables are also referenced in user code, SparkCLR fails to restore from checkpoint, and below exception is thrown. The root cause is that broadcast variables are not properly initialized before being referenced after restart.

[2016-02-03 11:06:57,751] [1] [INFO ] [Microsoft.Spark.CSharp.Worker] - rddInfo: rddId 35, stageId 13, partitionId 1
[2016-02-03 11:06:57,905] [1] [ERROR] [Microsoft.Spark.CSharp.Worker] - System.ArgumentException: Attempted to use broadcast id 0 after it was destroyed.
   at Microsoft.Spark.CSharp.Core.Broadcast`1.get_Value() in d:\SparkWorkspace\Microsoft\SparkCLR\csharp\Adapter\Microsoft.Spark.CSharp\Core\Broadcast.cs:line 97
   at Microsoft.Spark.CSharp.DStreamSamples.UpdateStateByKeyHelper.Execute(IEnumerable`1 vs, Int32 s) in d:\SparkWorkspace\Microsoft\SparkCLR\csharp\Samples\Microsoft.Spark.CSharp\DStreamSamples.cs:line 63
   at Microsoft.Spark.CSharp.Streaming.UpdateStateByKeyHelper`3.b__0(KeyValuePair`2 x) in d:\SparkWorkspace\Microsoft\SparkCLR\csharp\Adapter\Microsoft.Spark.CSharp\Streaming\PairDStreamFunctions.cs:line 616
   at System.Linq.Enumerable.WhereSelectEnumerableIterator`2.MoveNext()
   at System.Linq.Enumerable.d__94`1.MoveNext()
   at System.Linq.Enumerable.d__94`1.MoveNext()
   at System.Linq.Enumerable.WhereEnumerableIterator`1.MoveNext()
   at System.Linq.Enumerable.d__94`1.MoveNext()
   at Microsoft.Spark.CSharp.Worker.Main(String[] args) in d:\SparkWorkspace\Microsoft\SparkCLR\csharp\Worker\Microsoft.Spark.CSharp\Worker.cs:line 165

Inconsistent addressing schemes used by sockets in SparkCLR

In Microsoft.Spark.CSharp.Core.RDD, method Collect creates a socket without explicitly specifying the AddressFamily, which by default is InterNetworkv6 (ip v6). This also cause "invalid argument" exceptions on Linux when calling collect on RDD.
This is also inconsistent with other places in SpakCLR, e.g., SparkCLRSocket or JvmBridge, where sockets always explicitly use AddressFamily.InterNetwork (ip v4).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.