microsoft / mobius Goto Github PK

View Code? Open in Web Editor NEW

937.0 145.0 212.0 6.6 MB

C# and F# language binding and extensions to Apache Spark

License: MIT License

C# 84.14% Scala 10.14% Batchfile 1.21% PowerShell 1.81% Java 0.05% Shell 0.65% C++ 1.37% C 0.14% Python 0.41% F# 0.08%

spark apache-spark rdd dataframe dstream dataset streaming csharp mobius kafka-streaming

mobius's Introduction

Mobius development is deprecated and has been superseded by a more recent version '.NET for Apache Spark' from Microsoft (Website | GitHub) that runs on Azure HDInsight Spark, Amazon EMR Spark, Azure & AWS Databricks.

Mobius: C# API for Spark

Mobius provides C# language binding to Apache Spark enabling the implementation of Spark driver program and data processing operations in the languages supported in the .NET framework like C# or F#.

For example, the word count sample in Apache Spark can be implemented in C# as follows :

var lines = sparkContext.TextFile(@"hdfs://path/to/input.txt");  
var words = lines.FlatMap(s => s.Split(' '));
var wordCounts = words.Map(w => new Tuple<string, int>(w.Trim(), 1))  
                      .ReduceByKey((x, y) => x + y);  
var wordCountCollection = wordCounts.Collect();  
wordCounts.SaveAsTextFile(@"hdfs://path/to/wordcount.txt");

A simple DataFrame application using TempTable may look like the following:

var reqDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv");
var metricDataFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv");
reqDataFrame.RegisterTempTable("requests");
metricDataFrame.RegisterTempTable("metrics");
// C0 - guid in requests DataFrame, C3 - guid in metrics DataFrame  
var joinDataFrame = GetSqlContext().Sql(  
    "SELECT joinedtable.datacenter" +
         ", MAX(joinedtable.latency) maxlatency" +
         ", AVG(joinedtable.latency) avglatency " +
    "FROM (" +
       "SELECT a.C1 as datacenter, b.C6 as latency " +  
       "FROM requests a JOIN metrics b ON a.C0  = b.C3) joinedtable " +   
    "GROUP BY datacenter");
joinDataFrame.ShowSchema();
joinDataFrame.Show();

A simple DataFrame application using DataFrame DSL may look like the following:

// C0 - guid, C1 - datacenter
var reqDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv")  
                             .Select("C0", "C1");    
// C3 - guid, C6 - latency   
var metricDataFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv", ",", false, true)
                                .Select("C3", "C6"); //override delimiter, hasHeader & inferSchema
var joinDataFrame = reqDataFrame.Join(metricDataFrame, reqDataFrame["C0"] == metricDataFrame["C3"])
                                .GroupBy("C1");
var maxLatencyByDcDataFrame = joinDataFrame.Agg(new Dictionary<string, string> { { "C6", "max" } });
maxLatencyByDcDataFrame.ShowSchema();
maxLatencyByDcDataFrame.Show();

A simple Spark Streaming application that processes messages from Kafka using C# may be implemented using the following code:

StreamingContext sparkStreamingContext = StreamingContext.GetOrCreate(checkpointPath, () =>
    {
      var ssc = new StreamingContext(sparkContext, slideDurationInMillis);
      ssc.Checkpoint(checkpointPath);
      var stream = KafkaUtils.CreateDirectStream(ssc, topicList, kafkaParams, perTopicPartitionKafkaOffsets);
      //message format: [timestamp],[loglevel],[logmessage]
      var countByLogLevelAndTime = stream
                                    .Map(kvp => Encoding.UTF8.GetString(kvp.Value))
                                    .Filter(line => line.Contains(","))
                                    .Map(line => line.Split(','))
                                    .Map(columns => new Tuple<string, int>(
                                                          string.Format("{0},{1}", columns[0], columns[1]), 1))
                                    .ReduceByKeyAndWindow((x, y) => x + y, (x, y) => x - y,
                                                          windowDurationInSecs, slideDurationInSecs, 3)
                                    .Map(logLevelCountPair => string.Format("{0},{1}",
                                                          logLevelCountPair.Key, logLevelCountPair.Value));
      countByLogLevelAndTime.ForeachRDD(countByLogLevel =>
      {
          foreach (var logCount in countByLogLevel.Collect())
              Console.WriteLine(logCount);
      });
      return ssc;
    });
sparkStreamingContext.Start();
sparkStreamingContext.AwaitTermination();

For more code samples, refer to Mobius\examples directory or Mobius\csharp\Samples directory.

API Documentation

Refer to Mobius C# API documentation for the list of Spark's data processing operations supported in Mobius.

API Usage

Mobius API usage samples are available at:

Examples folder which contains standalone C# and F# projects that can be used as templates to start developing Mobius applications
Samples project which uses a comprehensive set of Mobius APIs to implement samples that are also used for functional validation of APIs
Mobius performance test scenarios implemented in C# and Scala for side by side comparison of Spark driver code

Documents

Refer to the docs folder for design overview and other info on Mobius

Build Status

Ubuntu 14.04.3 LTS	Windows	Unit test coverage

Getting Started

	Windows	Linux
Build & run unit tests	Build in Windows	Build in Linux
Run samples (functional tests) in local mode	Samples in Windows	Samples in Linux
Run examples in local mode	Examples in Windows	Examples in Linux
Run Mobius app	Standalone cluster YARN cluster	Linux cluster Azure HDInsight Spark Cluster AWS EMR Spark Cluster
Run Mobius Shell	Local YARN	Not supported yet

Useful Links

Supported Spark Versions

Mobius is built and tested with Apache Spark 1.4.1, 1.5.2, 1.6.* and 2.0.

Releases

Mobius releases are available at https://github.com/Microsoft/Mobius/releases. References needed to build C# Spark driver applicaiton using Mobius are also available in NuGet

Refer to mobius-release-info.md for the details on versioning policy and the contents of the release.

License

Mobius is licensed under the MIT license. See LICENSE file for full license information.

Community

Mobius project welcomes contributions. To contribute, follow the instructions in CONTRIBUTING.md
Options to ask your question to the Mobius community
- create issue on GitHub
- create post with "sparkclr" tag in Stack Overflow
- join chat at Mobius room in Gitter
- tweet @MobiusForSpark

Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

mobius's People

Contributors

Stargazers

Watchers

Forkers

jthelin qintao1976 guiling skaarthik jango2015 tawan0109 danielli90 caomw xiongrenyi jinyeqing hhland capetowncoders tralivali1234 lewis-liu-001 luistiagos gowhy nagyistoce ms-guizha danielli9090 transformersprimeabcxyz smartpcr zhangf911 biltekin kspk mkinoti rdavisau evaluation-alex terencecraig nkiranrao gitter-badger edepaula xsidurd ahmedkamal michael-stucco cyruszhang adamante dvakulen dennyglee zhuxiuwei mobiusforspark hebinhuang tanviraumi nqtran2871 daweicai isaacabraham jayjaywg forki waterlilys llenroc qualiu slyons earthdrawn hankg muntasirraihan rcackwyc kai-zeng outifaout modulexcite milestonesvn srinirv karimgarza nangal corba777 kleopatra999 beifeizhou richardy2012 benknightdark leon382 cosminboaca yxbdali myasuka is00hcw zhx37 nagyist 42machinelearning valmac dwnichols amironoff nsavga zillow-oc khaale cephdon jagadeesh1492 proeditor huangkbaaron cberi9496 anupam128 aatibudhi aaronontheweb liam0949 medvekoma atawat chenlongxi666 sk8tz jaceklaskowski shailendraksharma animenon diptej cronologi spakis

mobius's Issues

DataFrame API review - new in 1.6.0

Add new DataFrame APIs to SparkCLR

Implement a Daemon process to take charge of launching Worker processes

On linux, PythongWorkerFactory launches worker processes through a daemon process. To be linux compatible, we need to implement a daemon process similar to daemon.py in pyspark.

move build artifacts out of scripts folder

https://github.com/Microsoft/SparkCLR/tree/master/scripts folder was originally meant for scripts that will be included in the release (like sparkclr-sumibt). If we want to stick to that original scope, we need to move build-related scripts like addotherplugins.ps1 out of it.

add streaming example - EventHub

review SparkCLR for deprecated API in 1.6.0

Review deprecated API at https://spark.apache.org/docs/1.6.0/api/java/deprecated-list.html and update SparkCLR API.

load usage help text for samples from file

Improve PrintUsage implementation in https://github.com/Microsoft/SparkCLR/blob/master/csharp/Samples/Microsoft.Spark.CSharp/Program.cs#L215. Instead of inline strings, help text could be loaded from a text file. This help file can be referenced in readme file also.

FSharp API?

Would this project be open to supporting an FSharp API wrapper?

Something to make the it easier to write idiomatic FSharp code.

add Scala unit tests to CodeCov

Need to measure code coverage for unit tests at https://github.com/Microsoft/SparkCLR/tree/master/scala/src/test and report them through CodeCov.

Interactive (REPL) support in SparkCLR (for integration with higher-level tools - Jupyter)

SparkCLR needs to support the interactive mode that enables the end-user to express computation/analysis as a set of C# code statements that get executed in the REPL (Read, Evaluate, Print, Loop) way. An end-user could then interact with SparkCLR via a prompt. Interactive mode also provides a stepping stone for integration with high-level tools - particularly Jupyter.

Summarizing design considerations below.
High-level architecture/design for Jupyter/SparkCLR integration.

• SparkCLR’s launch mechanism (sparkclr-submit.cmd) gets a flag (-i, interactive mode). In the interactive mode, SparkCLR (actually CSharpRunner) spawns a RESTful server that has the following REST-based APIs.

Start Session – String Start(SparkConf conf)
End Session – void End(string sessionId)
Execute – ResultSet Execute(String sessionId, String code)

• Each SparkCLR session has an instance of IREPLEngine, the default implementation being scriptcs.
When a session is started, an instance of IREPLEngine is created that supports the Execute call (mirrors the REST-based Execute
API in terms of signature).
A predefined set of statements is executed that create an instance of SparkContext. This set of statements reflect the standard
way of creating SparkContext in SparkCLR. An IREPLEngine instance has an exclusive SparkContext handle. There can be multiple
active sessions though, each having its own handle to an IREPLEngine.

• An invocation of Execute(string sessionId, string code) receives a C# statement that is passed on to the IREPLEngine instance associated with the session (identified by sessionId) for execution. A statement may produce result that needs to be returned back to the caller. The result is encapsulated inside a ResultSet object which provides an iterator to traverse the individual items.

Integrating with Jupyter

Jupyter provides the notion of a “kernel” that is language specific and provides bindings for a given language. A C# kernel is required that allows C# code snippets to be collected from Jupyter’s interface and executed in the backend. ICSharp can be a starting point if not writing the kernel from scratch. The C# kernel has dependency on SparkCLR and invokes the SparkCLR.cmd launch script in the
interactive mode. Note that the SparkConfiguration (that determines which cluster user needs to connect with) is received from
Jupyter and passed on to the C# Kernel and eventually given to SparkCLR as part of arguments to the start script (sparkclr-submit.cmd).

• An end-user of Jupyter starts in auto-created session (the session is created by the C# Kernel using the RESTbased interface). By virtue of being in a session, the user has access to SparkContext instance that would be used for executing all statements entered as part of the session.
• A statement entered is passed verbatim to the C# Kernel that forwards it using the Execute method call. The Kernel is equipped with handing of the Result set object and presents the result (or exception) to Jupyter as per its protocol.

To summarize, the tasks can be bucketed into
SparkCLR
a) Building the RESTful Server that provides the 3 APIs
b) Modifying SparkCLR start script to accept interactive mode and launch the server
c) Building the C# Kernel (possibly on top of ICSharp, source code here)

and (b) can be reused to bridge with Zeppelin in future, if required.

get user confirmation before setting execution policy in RunSamples.cmd

Set-ExecutionPolicy is used in RunSamples.cmd and it changes the user preference for the Windows PowerShell execution policy. A warning has to be displayed to the user and execution policy should be updated only upon confirmation from the user. If it is possible to check the execution policy prior to doing anything regarding that, it would be preferable.

Where can I ask for help for SparkCLR

I have asked a question on stackoverflow here but looks like there are no followers for the sparkclr tag. I was wondering where can we request for help on using this?

Thanks
Kiran

Send out email notification after Travis CI build completes

Currently no email notification is sent out after Travis CI build completes (at least no email for a passed build), it will be very helpful if Travis CI sends email notification to PR owner after each build, just like what Appveyor does.

improve unit test coverage in "Sql" namespace

Following classes need improvements in unit test coverage.

Functions.cs
UserDefinedFunctions & corresponding methods in SqlContext
Row
Column

See code coverage numbers at https://codecov.io/github/Microsoft/SparkCLR/csharp/Adapter/Microsoft.Spark.CSharp/Sql

improve unit test coverage in utils

Following classes need improvements in unit test coverage.

JsonSerDe
PayloadHelper
SerDe
ConfigurationService

See code coverage numbers at https://codecov.io/github/Microsoft/SparkCLR/csharp/Adapter/Microsoft.Spark.CSharp

documentation for SparkCLR API

Need API documentation for M1 release

Add a WordCount example

Need a wordcount example similar to https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py

Improve documentation comments on APIs

Some documentation comments on APIs were copied from Scala or Python. Some of them still contain Scala/Python specified comments tags (e.g, C{***} in scala). Need to revisit documentation on APIs.

Missing license text in batch and powershell scripts

License header should be added to all batch (.cmd) and powershell scripts (.ps1)

Continuous integration for both Windows and Linux systems

In order to provide stable Linux compatibility for SparkCLR, we need to set up continuous integration for Linux. AppVeyor does not have Linux support. I have experimented with Travis CI, which worked well for SparkCLR on Linux. Travis CI uses xbuild for building, and supports NUnit for testing. It also support OS X, but it does not support Windows yet.

One option to our problem is to host our own Jenkins service, which does support both Windows and Linux. Another option may be having two CI tools at the same time.

The tasks of adding support for Linux can be divided into:

Having a Linux version of the Build/Test/Clean scripts, including all Build.cmd's, Clean.cmd, RunSamples.cmd and Test.cmd.
Setting up CI configuration

move SetSparkClrPackageVersion.ps1 to dev\scripts folder

The script is misplaced at https://github.com/Microsoft/SparkCLR/tree/master/examples.

need instructions to configure & use SparkCLR in existing Spark cluster

Need the instructions as a part of M! release package

improve unit test coverage in "Sql" namespace

Some classes still have code coverage less than 75%. We need to increase the coverage to 75% or more.

See https://codecov.io/github/Microsoft/Mobius/csharp/Adapter/Microsoft.Spark.CSharp/Sql

add validation for samples

There are still lot of samples that do not have validation. Since we are using samples as our functional test suite, we need to add validation for all samples.

Samples are at https://github.com/Microsoft/SparkCLR/tree/master/csharp/Samples/Microsoft.Spark.CSharp

Investigate SparkCLR compatibility in Linux

Explore if Mono or .NET core can be used to run SparkCLR in Linux based Spark clusters

Unable to run samples in standalone mode.

I am currently trying to run the samples for SparkCLR and the local samples from within the localmode folder works great. However, when trying to execute the samples against my Spark server (cluster) using the sparkclr-submit.cmd script:

C:\MyData\Apache_Spark\SparkCLR-master\build\runtime>sparkclr-submit.cmd --verbose --master spark://spark01:7077 --exe SparkCLRSamples.exe %SPARKCLR_HOME%\samples spark.local.dir %SPARKCLR_HOME%\Temp sparkclr.sampledata.loc %SPARKCLR_HOME%\data

I am getting the following error:
The system cannot find the path specified.
SPARKCLR_JAR=spark-clr_2.10-1.6.0-SNAPSHOT.jar
Error: Could not find or load main class org.apache.spark.launcher.SparkCLRSubmitArguments

is this another environment variable that needs to be set? I have the SPARKCLR_HOME and JAVA_HOME variables. Are there more that are needed?

I am on the latest SparkCLR, downloaded a day ago.

Thanks all.

updates to design spec & slide deck

The docs at https://github.com/Microsoft/SparkCLR/tree/master/docs do not include streaming processing that was recently added to SparkCLR after the initial version of docs were uploaded. We need to keep it updated for contributors and users.

@xiongrenyi - can you update the doc?

set executable name as app name

"org.apache.spark.deploy.csharp.CSharpRunner" shows as the app name in Spark UI. All SparkCLR apps will have the same name if app.name is not explicitly set. We could set exe name as the app name if it is provided explicitly.

@tawan0109 - I vaguely remember fixing this issue a while back. Did anything change recently?

StreamingContextIpcProxy.DirectKafkaStream doesn't work in Spark 1.6.0

In StreamingContextIpcProxy.DirectKafkaStream, it calls the following method in Java side:

org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper.createDirectStream

But this method doesn't exist in Spark 1.6.0

examples folder in release zip does not have binaries

The release zip file includes source files under "examples" folder. Source files need to be replaced with binaries that can be run directly without the need to build.

https://github.com/Microsoft/SparkCLR/releases/download/v1.5.200/spark-clr_2.10-1.5.200.zip

add streaming example - Kafka

Need a standalone example at https://github.com/Microsoft/SparkCLR/tree/master/examples

Build.cmd failed when Nuget.config is modified by other application

I hit the build.cmd error in building csharp folder. Digging into it I found the "nuget restore" didn't pull the packages into the correct folder, instead, it pulled the packages into a shared folder. This is because I have a pacman main empty enlistment in the same computer, which changed the nuget.config as:

at C:\Users\rogeryu\AppData\Roaming\NuGet\NuGet.Config

This makes nuget to pull packages into c:\src\Cxcache.

This could be fixed by forcing the package folder in build.cmd:

nuget restore -PackagesDirectory packages

RDD API review - new in 1.6.0

Add new RDD APIs to SparkCLR

Define a public constant for SPARKCLR_HOME in ConfigurationService.cs

"SPARKCLR_HOME" should be defined as a public constant in ConfigurationService.cs. See original comment in #77

Build problem with VS2015 on VSO

The latest PR #38 seems to have broken my VSO build when using VSO's VS2015 (MsBuild 14) settings.

Build works ok when using VSO's VS2013 (MsBuild 12) settings, so i think it is similar problem to issue #8.

Paradoxically, my AppVeyor build also uses msbuild 14 and that works ok, so seems to be some specific problem with VSO environment :(
https://ci.appveyor.com/project/jthelin/sparkclr/build/1.4.1-SNAPSHOT.50

Sorry, i don't have VS2015 installed on any of my dev machines to suggest a PR fix for this issue.

csharp\Adapter\Microsoft.Spark.CSharp\Streaming\DStream.cs(427,38): Error CS0029: Cannot implicitly convert type 'T' to 'System.Collections.Generic.KeyValuePair<int, T>'

csharp\Adapter\Microsoft.Spark.CSharp\Streaming\DStream.cs(427,38): Error CS1662: Cannot convert lambda expression to intended delegate type because some of the return types in the block are not implicitly convertible to the delegate return type

cant run the samples

Error: Could not find or load main class org.apache.spark.launcher.SparkCLRSubmitArguments

em... can someone tell me why?

DStream API review - new in 1.6.0

Add new DStream APIs to SparkCLR

incorrect folder name "run" in instructions

https://github.com/Microsoft/SparkCLR/blob/master/notes/windows-instructions.md#instructions refer to a folder named "SparkCLR/run" incorrectly. It should be updated to "SparkCLR\build\runtime".

https://github.com/Microsoft/SparkCLR/blob/master/notes/linux-instructions.md also has this issue.

review SparkCLR for deprecated API

As SparkCLR is upgraded to Spark 1.5.2, we need to review deprecation list for 1.4.1 (https://spark.apache.org/docs/1.4.1/api/java/deprecated-list.html#method) and cleanup SparkCLR to ensure it does not take any dependency on these methods and also does not provide the corresponding C# APIs in M1.

review and improve documentation before M2 release

There are some suggestions to improve documentation in the comments in
#294. Need to address them.

Switch to fluent assertion

Currently Assert() in Microsoft.VisualStudio.QualityTools.UnitTestFramework are used in samples and unit tests code. FluentAssertions (or N-Fluent) offers a fluent style assertion, helps pinpoint the cause of assertion much more efficient. See original comment by @jthelin suggested in #68. Create this issue as suggestions to adopt fluent assertion as we write/update assertion in samples and unit test code.

Errors detected by "RunSamples.cmd --validate" do not propagate to AppVeyor's build status

We set to test the samples as a after_test task in AppVeyor. However, currently the test status is not reflected in the build status of AppVeyor. Specifically, even SamplesRunner has failed samples, AppVeyor still reports "build succeeded".

To solve this bug, we should let SamplesRunner propagate the status through ReportOutcome.

Issue Running PI example on Linux Standalone Cluster

I am able to run the samples locally, but when pointing to my Linux Cluster, I am unable to run the PI sample, as follows:

** Client Mode **
scripts\sparkclr-submit.cmd --proxy-user miadmin --total-executor-cores 2 --master spark://spark01:7077 --exe Pi.exe C:\MyData\Apache_Spark\SparkCLR-master\examples\pi\bin\Debug spark.local.dir %temp%

Getting the following error(s):
"C:\MyData\Apache_Spark\SparkCLR-master\build\tools\spark-1.6.0-bin-hadoop2.6\conf\spark-env.cmd"
SPARKCLR_JAR=spark-clr_2.10-1.6.0-SNAPSHOT.jar
Zip driver directory C:\MyData\Apache_Spark\SparkCLR-master\examples\pi\bin\Debug to C:\Users\shunley\AppData\Local\Temp\Debug_1453925538545.zip
[sparkclr-submit.cmd] Command to run --proxy-user miadmin --total-executor-cores 2 --master spark://spark01:7077 --name Pi --files C:\Users\shunley\AppData\Local\Temp\Debug_1453925538545.zip --class org.apache.spark.deploy.csharp.CSharpRunner C:\MyData\Apache_Spark\SparkCLR-master\build\runtime\lib\spark-clr_2.10-1.6.0-SNAPSHOT.jar C:\MyData\Apache_Spark\SparkCLR-master\examples\pi\bin\Debug C:\MyData\Apache_Spark\SparkCLR-master\examples\pi\bin\Debug\Pi.exe spark.local.dir C:\Users\shunley\AppData\Local\Temp
[CSharpRunner.main] Starting CSharpBackend!
[CSharpRunner.main] Port number used by CSharpBackend is 4485
[CSharpRunner.main] adding key=spark.jars and value=file:/C:/MyData/Apache_Spark/SparkCLR-master/build/runtime/lib/spark-clr_2.10-1.6.0-SNAPSHOT.jar to environment
[CSharpRunner.main] adding key=spark.app.name and value=Pi to environment
[CSharpRunner.main] adding key=spark.cores.max and value=2 to environment
[CSharpRunner.main] adding key=spark.files and value=file:/C:/Users/shunley/AppData/Local/Temp/Debug_1453925538545.zip to environment
[CSharpRunner.main] adding key=spark.submit.deployMode and value=client to environment
[CSharpRunner.main] adding key=spark.master and value=spark://spark01:7077 to environment
[2016-01-27T20:12:19.7218665Z] [SHUNLEY10] [Info] [ConfigurationService] ConfigurationService runMode is CLUSTER
[2016-01-27T20:12:19.7228674Z] [SHUNLEY10] [Info] [SparkCLRConfiguration] CSharpBackend successfully read from environment variable CSHARPBACKEND_PORT
[2016-01-27T20:12:19.7228674Z] [SHUNLEY10] [Info] [SparkCLRIpcProxy] CSharpBackend port number to be used in JvMBridge is 4485
[2016-01-27 15:12:19,866] [1] [DEBUG] [Microsoft.Spark.CSharp.Examples.PiExample] - spark.local.dir is set to C:\Users\shunley\AppData\Local\Temp
[2016-01-27 15:12:21,467] [1] [INFO ] [Microsoft.Spark.CSharp.Examples.PiExample] - ----- Running Pi example -----
collectAndServe on object of type NullObject failed
null
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.api.csharp.CSharpBackendHandler.handleMethodCall(CSharpBackendHandler.scala:153)
at org.apache.spark.api.csharp.CSharpBackendHandler.channelRead0(CSharpBackendHandler.scala:94)
at org.apache.spark.api.csharp.CSharpBackendHandler.channelRead0(CSharpBackendHandler.scala:27)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 9, spark02): java.io.IOException: Cannot run program "CSharpWorker.exe": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:161)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:87)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:63)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:134)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:101)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.api.csharp.CSharpRDD.compute(CSharpRDD.scala:62)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:187)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
... 15 more

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
... 25 more
Caused by: java.io.IOException: Cannot run program "CSharpWorker.exe": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:161)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:87)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:63)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:134)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:101)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.api.csharp.CSharpRDD.compute(CSharpRDD.scala:62)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:187)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
... 15 more
()
methods:
public static int org.apache.spark.api.python.PythonRDD.collectAndServe(org.apache.spark.rdd.RDD)
args:
argType: org.apache.spark.api.csharp.CSharpRDD, argValue: CSharpRDD[1] at RDD at PythonRDD.scala:43
[2016-01-27T20:12:28.0995397Z] [SHUNLEY10] [Error] [JvmBridge] JVM method execution failed: Static method collectAndServe failed for class org.apache.spark.api.python.PythonRDD when called with 1 parameters ([Index=1, Type=JvmObjectReference, Value=12], )
[2016-01-27T20:12:28.0995397Z] [SHUNLEY10] [Error] [JvmBridge] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 9, spark02): java.io.IOException: Cannot run program "CSharpWorker.exe": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:161)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:87)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:63)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:134)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:101)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.api.csharp.CSharpRDD.compute(CSharpRDD.scala:62)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:187)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
... 15 more

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.api.csharp.CSharpBackendHandler.handleMethodCall(CSharpBackendHandler.scala:153)
at org.apache.spark.api.csharp.CSharpBackendHandler.channelRead0(CSharpBackendHandler.scala:94)
at org.apache.spark.api.csharp.CSharpBackendHandler.channelRead0(CSharpBackendHandler.scala:27)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Cannot run program "CSharpWorker.exe": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:161)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:87)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:63)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:134)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:101)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.api.csharp.CSharpRDD.compute(CSharpRDD.scala:62)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:187)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
... 15 more

[2016-01-27T20:12:28.1296129Z] [SHUNLEY10] [Exception] [JvmBridge] JVM method execution failed: Static method collectAndServe failed for class org.apache.spark.api.python.PythonRDD when called with 1 parameters ([Index=1, Type=JvmObjectReference, Value=12], )
at Microsoft.Spark.CSharp.Interop.Ipc.JvmBridge.CallJavaMethod(Boolean isStatic, Object classNameOrJvmObjectReference, String methodName, Object[] parameters)
[2016-01-27 15:12:28,130] [1] [INFO ] [Microsoft.Spark.CSharp.Examples.PiExample] - ----- Error running Pi example (duration=00:00:06.6599877) -----
System.Exception: JVM method execution failed: Static method collectAndServe failed for class org.apache.spark.api.python.PythonRDD when called with 1 parameters ([Index=1, Type=JvmObjectReference, Value=12], )
at Microsoft.Spark.CSharp.Interop.Ipc.JvmBridge.CallJavaMethod(Boolean isStatic, Object classNameOrJvmObjectReference, String methodName, Object[] parameters)
at Microsoft.Spark.CSharp.Interop.Ipc.JvmBridge.CallStaticJavaMethod(String className, String methodName, Object[] parameters)
at Microsoft.Spark.CSharp.Proxy.Ipc.RDDIpcProxy.CollectAndServe()
at Microsoft.Spark.CSharp.Core.RDD1.Collect() at Microsoft.Spark.CSharp.Core.RDD1.Reduce(Func`3 f)
at Microsoft.Spark.CSharp.Examples.PiExample.Pi() in C:\MyData\Apache_Spark\SparkCLR-master\examples\Pi\Program.cs:line 76
at Microsoft.Spark.CSharp.Examples.PiExample.Main(String[] args) in C:\MyData\Apache_Spark\SparkCLR-master\examples\Pi\Program.cs:line 35
[2016-01-27 15:12:28,131] [1] [INFO ] [Microsoft.Spark.CSharp.Examples.PiExample] - Completed running examples. Calling SparkContext.Stop() to tear down ...
[2016-01-27 15:12:28,131] [1] [INFO ] [Microsoft.Spark.CSharp.Examples.PiExample] - If this program (SparkCLRExamples.exe) does not terminate in 10 seconds, please manually terminate java process launched by this program!!!
Requesting to close all call back sockets.
[CSharpRunner.main] closing CSharpBackend
Requesting to close all call back sockets.
[CSharpRunner.main] Return CSharpBackend code 1
Utils.exit() with status: 1, maxDelayMillis: 1000

I have a couple of questions as the documentation and the quickstart here: https://github.com/Microsoft/SparkCLR/wiki/Quick-Start , didn't really talk about it. I'm new to Spark, so I apologize in advance if these questions are "newbie-ish".

When the Quickstart says to use the following command for a standalone cluster environment:
cd \path\to\runtime
scripts\sparkclr-submit.cmd ^
--total-executor-cores 2 ^
--master spark://host:port ^
--exe Pi.exe ^
\path\to\Pi\bin[debug|release] ^
spark.local.dir %temp%

I understand the navigating to the runtime folder (locally or on the submitting server) part on the first line. I get specifying the master so it knows which spark cluster to run on (this is the remote spark cluster). Now, what is confusing here is are we still pointing to the local (windows) file system for the Pi executable and the temp directory?

We're currently looking to use Spark and SparkR to do our processing from our application and I am just trying to understand how your API interacts with Spark, submitting work, retrieving results, etc..

Any help getting the Cluster Samples up and running (Client and Cluster mode) would be greatly appreciated.

Thanks,

Scott

Update Pom.xml to support creating uber package as a separate goal

Pom.xml is ued to produce lean and mean SparkCLR package by default. This default Pom.xml supports debug mode in IDE like IntelliJ. For end-2-end testing and job submission, as a workaround, build.cmd modifies pom.xml at build-time to produce uber SparkCLR package which includes dependent packages (PR#127)

The workaround needs to be replaced by a separate goal in Pom.xml, explicitly.

Problem building SparkCLR code with msbuild VS2015 settings on VSO

Any clues to why this code won't build with msbuild 2015 on VSO?

I set up an automated build from github master branch on my VSO instance with default settings for the msbuild step, and strangely the build failed with the errors below.

I knew the code built ok for me on my local machine (which has VS2013 installed), so as an experiment I set the VSO build step to use "VS2013" setting instead, and strangely the VSO build succeeded!

I then switched the VSO build step back to the "VS2015" setting and got the original failure again :(

I haven't got VS2015 on any of my machines yet, but I would guess that interactive build might show the same problem?

Any thoughts?

2015-10-31T17:54:51.2733808Z ##[error]csharp\Adapter\Microsoft.Spark.CSharp\Core\RDD.cs(258,109): Error CS0029: Cannot implicitly convert type 'T' to 'System.Collections.Generic.KeyValuePair<T, int>'
2015-10-31T17:54:51.2743806Z      2>Core\RDD.cs(258,109): error CS0029: Cannot implicitly convert type 'T' to 'System.Collections.Generic.KeyValuePair<T, int>' [C:\a\1\s\csharp\Adapter\Microsoft.Spark.CSharp\Adapter.csproj]
2015-10-31T17:54:51.2743806Z ##[error]csharp\Adapter\Microsoft.Spark.CSharp\Core\RDD.cs(258,109): Error CS1662: Cannot convert lambda expression to intended delegate type because some of the return types in the block are not implicitly convertible to the delegate return type
2015-10-31T17:54:51.2753809Z      2>Core\RDD.cs(258,109): error CS1662: Cannot convert lambda expression to intended delegate type because some of the return types in the block are not implicitly convertible to the delegate return type [C:\a\1\s\csharp\Adapter\Microsoft.Spark.CSharp\Adapter.csproj]

Initialize Broadcast variables properly when SparkCLR restores from checkpoint

When checkpoint is enabled and broadcast variables are also referenced in user code, SparkCLR fails to restore from checkpoint, and below exception is thrown. The root cause is that broadcast variables are not properly initialized before being referenced after restart.

[2016-02-03 11:06:57,751] [1] [INFO ] [Microsoft.Spark.CSharp.Worker] - rddInfo: rddId 35, stageId 13, partitionId 1
[2016-02-03 11:06:57,905] [1] [ERROR] [Microsoft.Spark.CSharp.Worker] - System.ArgumentException: Attempted to use broadcast id 0 after it was destroyed.
   at Microsoft.Spark.CSharp.Core.Broadcast`1.get_Value() in d:\SparkWorkspace\Microsoft\SparkCLR\csharp\Adapter\Microsoft.Spark.CSharp\Core\Broadcast.cs:line 97
   at Microsoft.Spark.CSharp.DStreamSamples.UpdateStateByKeyHelper.Execute(IEnumerable`1 vs, Int32 s) in d:\SparkWorkspace\Microsoft\SparkCLR\csharp\Samples\Microsoft.Spark.CSharp\DStreamSamples.cs:line 63
   at Microsoft.Spark.CSharp.Streaming.UpdateStateByKeyHelper`3.b__0(KeyValuePair`2 x) in d:\SparkWorkspace\Microsoft\SparkCLR\csharp\Adapter\Microsoft.Spark.CSharp\Streaming\PairDStreamFunctions.cs:line 616
   at System.Linq.Enumerable.WhereSelectEnumerableIterator`2.MoveNext()
   at System.Linq.Enumerable.d__94`1.MoveNext()
   at System.Linq.Enumerable.d__94`1.MoveNext()
   at System.Linq.Enumerable.WhereEnumerableIterator`1.MoveNext()
   at System.Linq.Enumerable.d__94`1.MoveNext()
   at Microsoft.Spark.CSharp.Worker.Main(String[] args) in d:\SparkWorkspace\Microsoft\SparkCLR\csharp\Worker\Microsoft.Spark.CSharp\Worker.cs:line 165

improve unit test coverage in "Core" namespace

Following classes need improvements in unit test coverage.

Accumulator
Broadcast
SparkContext
RDD
StatusTracker

See code coverage numbers at https://codecov.io/github/Microsoft/SparkCLR/csharp/Adapter/Microsoft.Spark.CSharp/Core

Inconsistent addressing schemes used by sockets in SparkCLR

In Microsoft.Spark.CSharp.Core.RDD, method Collect creates a socket without explicitly specifying the AddressFamily, which by default is InterNetworkv6 (ip v6). This also cause "invalid argument" exceptions on Linux when calling collect on RDD.
This is also inconsistent with other places in SpakCLR, e.g., SparkCLRSocket or JvmBridge, where sockets always explicitly use AddressFamily.InterNetwork (ip v4).

microsoft / mobius Goto Github PK

mobius's Introduction

Mobius development is deprecated and has been superseded by a more recent version '.NET for Apache Spark' from Microsoft (Website | GitHub) that runs on Azure HDInsight Spark, Amazon EMR Spark, Azure & AWS Databricks.

Mobius: C# API for Spark

API Documentation

API Usage

Documents

Build Status

Getting Started

Useful Links

Supported Spark Versions

Releases

License

Community

Code of Conduct

mobius's People

Contributors

Stargazers

Watchers

Forkers

mobius's Issues

Recommend Projects

Recommend Topics

Recommend Org