Giter Club home page Giter Club logo

Comments (11)

rjurney avatar rjurney commented on August 15, 2024 1

@markucsb I think I resolved this in a596fc6 by following your instructions. Would you mind testing things out?

from agile_data_code_2.

pdawczak avatar pdawczak commented on August 15, 2024 1

Thanks @rjurney for your response. It's a shame - I was counting on being able to set up local environment... I commute roughly about 3 hours to work in the train and being able to do code exercises on my machine would let me progress with the book.

I appreciate managing Vagrant in this instance is difficult for you, but would you be able to point me to some resources (eg. what versions of tools, how to manage vagrant) so I could investigate myself?

Thanks!

from agile_data_code_2.

pdawczak avatar pdawczak commented on August 15, 2024

Similar here - I was able to execute simple queries against my elasticsearch, but pyspark fails to do so.

  1. I cdd to ch02 dir and run my pyspark with command:
PYSPARK_DRIVER_PYTHON=ipython pyspark --jars ../lib/mongo-hadoop-spark-2.0.2.jar,../lib/mongo-java-driver-3.4.2.jar,../lib/mongo-hadoop-2.0.2.jar --driver-class-path ../lib/mongo-hadoop-spark-2.0.2.jar:../lib/mongo-java-driver-3.4.2.jar:../lib/mongo-hadoop-2.0.2.jar
  1. In there, I've run:
>>> exec(open("pyspark_elasticsearch.py").read())
  1. which resulted with error:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 10, in <module>
  File "/home/vagrant/spark/python/pyspark/rdd.py", line 1421, in saveAsNewAPIHadoopFile
    keyConverter, valueConverter, jconf)
  File "/home/vagrant/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/home/vagrant/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/home/vagrant/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile.
: java.lang.ClassNotFoundException: org.elasticsearch.hadoop.mr.LinkedMapWritable
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at org.apache.spark.util.Utils$.classForName(Utils.scala:229)
        at org.apache.spark.api.python.PythonRDD$$anonfun$getKeyValueTypes$1$$anonfun$apply$14.apply(PythonRDD.scala:738)
        at org.apache.spark.api.python.PythonRDD$$anonfun$getKeyValueTypes$1$$anonfun$apply$14.apply(PythonRDD.scala:737)
        at scala.Option.map(Option.scala:146)
        at org.apache.spark.api.python.PythonRDD$$anonfun$getKeyValueTypes$1.apply(PythonRDD.scala:737)
        at org.apache.spark.api.python.PythonRDD$$anonfun$getKeyValueTypes$1.apply(PythonRDD.scala:736)
        at scala.Option.flatMap(Option.scala:171)
        at org.apache.spark.api.python.PythonRDD$.getKeyValueTypes(PythonRDD.scala:736)
        at org.apache.spark.api.python.PythonRDD$.saveAsNewAPIHadoopFile(PythonRDD.scala:828)
        at org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile(PythonRDD.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:748)

I'm afraid error message doesn't tell me much...

from agile_data_code_2.

markucsb avatar markucsb commented on August 15, 2024

One thing I noticed is that the Vagrant provisioning has errors in several places, including the download and installation of both kafka and hadoop, it references versions that are no longer there.

from agile_data_code_2.

pdawczak avatar pdawczak commented on August 15, 2024

Thanks @markucsb for your comment - do you have any suggestions or are you aware of how could that be fixed? I've stoped reading the book as I'm unable to reproduce those simple tests...

from agile_data_code_2.

markucsb avatar markucsb commented on August 15, 2024

Did not mean to Close the issue, I do not have any suggestions at this point, I've now done the EC2 instance setup and am hitting another issue. I am also blocked by this issue and cannot move forward in the book, very frustrating.

schema_data.saveAsNewAPIHadoopFile(
... path='-',
... outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
... keyClass="org.apache.hadoop.io.NullWritable",
... valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
... conf={ "es.resource" : "agile_data_science/executives" })
Traceback (most recent call last):
File "", line 6, in
File "/home/ubuntu/spark/python/pyspark/rdd.py", line 1421, in saveAsNewAPIHadoopFile
keyConverter, valueConverter, jconf)
File "/home/ubuntu/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/home/ubuntu/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/ubuntu/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile.
: java.lang.NoClassDefFoundError: org/apache/commons/httpclient/protocol/ProtocolSocketFactory
at org.elasticsearch.hadoop.rest.commonshttp.CommonsHttpTransportFactory.create(CommonsHttpTransportFactory.java:39)
at org.elasticsearch.hadoop.rest.NetworkClient.selectNextNode(NetworkClient.java:99)
at org.elasticsearch.hadoop.rest.NetworkClient.(NetworkClient.java:82)
at org.elasticsearch.hadoop.rest.NetworkClient.(NetworkClient.java:59)
at org.elasticsearch.hadoop.rest.RestClient.(RestClient.java:92)
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:284)
at org.elasticsearch.hadoop.mr.EsOutputFormat.init(EsOutputFormat.java:260)
at org.elasticsearch.hadoop.mr.EsOutputFormat.checkOutputSpecs(EsOutputFormat.java:233)
at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:76)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:1003)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:994)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:994)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:994)
at org.apache.spark.api.python.PythonRDD$.saveAsNewAPIHadoopFile(PythonRDD.scala:839)
at org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.httpclient.protocol.ProtocolSocketFactory
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 36 more

from agile_data_code_2.

markucsb avatar markucsb commented on August 15, 2024

I was able to get it to work on the ec2 instance by installing http://central.maven.org/maven2/commons-httpclient/commons-httpclient/3.1/commons-httpclient-3.1.jar within ~/Agile_Data_Code_2/lib/
and add it at the end of this line:
spark.jars /home/ubuntu/Agile_Data_Code_2/lib/mongo-hadoop-spark-2.0.2.jar,/home/ubuntu/Agile_Data_Code_2/lib/mongo-java-driver-3.4.2.jar,/home/ubuntu/Agile_Data_Code_2/lib/mongo-hadoop-2.0.2.jar,/home/ubuntu/Agile_Data_Code_2/lib/elasticsearch-spark-20_2.10-5.5.1.jar,/home/ubuntu/Agile_Data_Code_2/lib/snappy-java-1.1.2.6.jar,/home/ubuntu/Agile_Data_Code_2/lib/lzo-hadoop-1.0.5.jar

within the /home/vagrant/spark/conf/spark-defaults.conf

from agile_data_code_2.

rjurney avatar rjurney commented on August 15, 2024

@markucsb @pdawczak Sorry for the problems. The Vagrant image is deprecated, I can't manage to support two environments and trying to do so was a mistake. Use the EC2 scripts, and I will fix any issues you find there. The EC2 instance I use is less expensive than the book if you stop them when you aren't using it. Please bare with me!

from agile_data_code_2.

markucsb avatar markucsb commented on August 15, 2024

@rjurney looks like the ec2 bootstrap has the same issue with kafka version, it is not installed properly. Can you update that before I test the other changes. Airflow is also not installed

from agile_data_code_2.

rjurney avatar rjurney commented on August 15, 2024

Thanks for letting me know. I'll take a look at it tonight.

from agile_data_code_2.

rjurney avatar rjurney commented on August 15, 2024

I hate to do this, but Vagrant has been deprecated. I can't support two systems on my limited time. Use EC2. :(

from agile_data_code_2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.