angel-ml / pytorch-on-angel Goto Github PK
View Code? Open in Web Editor NEWPyTorch On Angel, arming PyTorch with a powerful Parameter Server, which enable PyTorch to train very big models.
PyTorch On Angel, arming PyTorch with a powerful Parameter Server, which enable PyTorch to train very big models.
之前用0.2版本没有这个错误,不知道是因为更新了什么导致的
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126) at org.apache.spark.rdd.RDD.count(RDD.scala:1168) at com.tencent.angel.pytorch.graph.gcn.GCN.makeGraph(GCN.scala:60) at com.tencent.angel.pytorch.graph.gcn.GNN.initialize(GNN.scala:99) at com.tencent.angel.pytorch.examples.supervised.cluster.GraphSageExample$.main(GraphSageExample.scala:150) at com.tencent.angel.pytorch.examples.supervised.cluster.GraphSageExample.main(GraphSageExample.scala) Caused by: com.tencent.angel.exception.AngelException: com.tencent.angel.exception.AngelException: node id is not in range [0, 10 at com.tencent.angel.psagent.matrix.MatrixClientImpl.get(MatrixClientImpl.java:732) at com.tencent.angel.spark.models.impl.PSVectorImpl.psfGet(PSVectorImpl.scala:78) at com.tencent.angel.pytorch.graph.gcn.GNNPSModel.readLabels2(GNNPSModel.scala:71) at com.tencent.angel.pytorch.graph.gcn.GraphAdjPartition.splitTrainTest(GraphPartition.scala:62) at com.tencent.angel.pytorch.graph.gcn.GraphAdjPartition.toSemiGCNPartition(GraphPartition.scala:49) at com.tencent.angel.pytorch.graph.gcn.GCN$$anonfun$4.apply(GCN.scala:54) at com.tencent.angel.pytorch.graph.gcn.GCN$$anonfun$4.apply(GCN.scala:54) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
Add some parameters to CMakeLists, in order to support GAT, HGAT, such as require torch version >=1.4.0, link torch_scatter library when use GAT, HGAT algorithm.
Integrate one-hop and two-hop for DGI algorithm
support dssm algorithm on pytorch on angel
1. 平台: AT平台的虚拟机都可以,另外一个云不行!!!
实测另外一个平台搭建过程会报其他错,可能局域网有些其他设置或者hostname有问题吧。
2. 编译方式: 本地编译,伪分布式配置,系统centOS 7.2。
3. gcc: 7.3版本即可, cmake 3.21版本配置libtorch时候会报warning不知道会不会有问题,我后面换成3.12跑通的。
参考网页:
centOS下gcc的版本升级:https://blog.csdn.net/ncdx111/article/details/106047228
cmake下载安装:https://blog.csdn.net/weixin_30781433/article/details/98787965?utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-1.base&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-1.base
1. hadoop: 版本选2.7.x即可,2.7.1和2.7.5亲测可用。
2. spark: 之前群里有人测过这里spark2.3.0是必须的, 2.4.0版本会报错。
3. pytorch: pytorch版本为1.3.1,torchvision为0.4.2。这里pytorch我理解是生成模型用的,不知道运行时候还需不需要了。
libtorch:这里libtorch选1.3.1,pytorch官网有同版本libtorch,pre-ABI和ABI,编译时候都是对的,但有一个在运行时候会报
符号错误。我记得pre-ABI应该是可用的,记不太清楚了。
参考网页:
hadoop搭建:https://blog.csdn.net/csdnmrliu/article/details/82963783(非源码编译下载后解压缩,配置环境变量即可)
github下载加速:https://blog.csdn.net/haejwcalcv/article/details/108028245
spark搭建:https://archive.apache.org/dist/,下载压缩包后解压缩然后配置conf文件夹后的env脚本即可
配置之前有同学发了,可以直接用。
这几行到没有什么,java注意不要配错就行。
ANGEL包的地址配错,或者scla地址配错或者jar包遗漏都会报error,exit 0,具体看就是example.scala的第80行报错,也就是读取那块,我卡在这里卡了很久。
中间还有一些其他乱七八糟的问题,不太想回滚聊天记录了,应该都可以百度解决。
https://blog.csdn.net/qq_50665031/article/details/108987205 这是一个安装glbic2.23的网页,忘记做什么时候用到的了,有其他人遇到可以看一下。
over.
here is the maven ( mvn clean package -Dmaven.test.skip=true
) output log:
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 26.028 s
[INFO] Finished at: 2022-05-05T16:24:57Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project pytorch-on-angel: Could not resolve dependencies for project com.tencent.angel:pytorch-on-angel:jar:0.3.0: Could not find artifact edu.princeton.cs:algs4:jar:1.0.3 in central (https://repo.maven.apache.org/maven2) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
and this error also happended even if using Docker method recommended in Readme.md
how can I solve this problem? anyone please
跑demo的时候,发现
Exception in thread "main" java.lang.NoClassDefFoundError: com/tencent/angel/ml/matrix/RowType
at com.tencent.angel.pytorch.model.TorchParams.(TorchParams.scala:29)
at com.tencent.angel.pytorch.model.ParTorchModel.init(ParTorchModel.scala:69)
at com.tencent.angel.pytorch.examples.ClusterExample$.main(ClusterExample.scala:68)
at com.tencent.angel.pytorch.examples.ClusterExample.main(ClusterExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: com.tencent.angel.ml.matrix.RowType
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 16 more
这个类在44b7c476d0c6b2aab7399f5200f64c44aa973fb5这次提交的时候修改路径了,到angel-math-0.1.0.jar里去了,请问该怎么处理?
实际运行中,发现,起一个spark进程,两个 angel ps 进程。并且有一个angel ps进程会在spark进程之前结束。然后,运行日志中提示,send stop command to Master failed
求尽快解答!
2019-09-05 15:13:06 ERROR AngelClient:480 - send stop command to Master failed
com.google.protobuf.ServiceException: java.util.concurrent.ExecutionException: java.io.IOException: Error connecting to /13.190.232.43:21029
at com.tencent.angel.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:317)
at com.sun.proxy.$Proxy25.stop(Unknown Source)
at com.tencent.angel.client.AngelClient.stop(AngelClient.java:477)
at com.tencent.angel.client.AngelPSClient.stopPS(AngelPSClient.java:181)
at com.tencent.angel.spark.context.AngelPSContext$.doStop(AngelPSContext.scala:441)
at com.tencent.angel.spark.context.AngelPSContext$$anon$2.run(AngelPSContext.scala:323)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Error connecting to /10.110.132.43:21029
at com.tencent.angel.ipc.CallFuture.get(CallFuture.java:121)
at com.tencent.angel.ipc.NettyTransceiver.call(NettyTransceiver.java:297)
at com.tencent.angel.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:294)
... 6 more
Caused by: java.io.IOException: Error connecting to /10.110.132.43:21029
at com.tencent.angel.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:149)
at com.tencent.angel.ipc.NettyTransceiver.transceive(NettyTransceiver.java:338)
at com.tencent.angel.ipc.NettyTransceiver.call(NettyTransceiver.java:292)
... 7 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.110.132.43:21029
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused
... 11 more
2019-09-05 15:13:06 INFO YarnClientImpl:395 - Killed application application_1567663579362_0002
End of LogType:stdout
能否给一个教程,告诉我们要当前版本的pytorch-on-angel要搭配什么版本的angel以及其他的dependency?我反复尝试了很久,还是有各种version不匹配的问题。
add edge feature for gnn
support HGAT alogrithm on pytorch on angel, which is a novel heterogeneous graph neural network.
20/11/12 15:21:17 INFO BlockManager: Found block rdd_27_3 locally
20/11/12 15:21:17 INFO BlockManager: Found block rdd_27_0 locally
20/11/12 15:21:17 INFO BlockManager: Found block rdd_27_1 locally
20/11/12 15:21:17 INFO BlockManager: Found block rdd_27_2 locally
terminate called after throwing an instance of 'c10::Error'
what(): forward_() is missing value for argument 'second_edge_index'. Declaration: forward_(ClassType self, Tensor pos_x, Tensor neg_x, Tensor first_edge_index, Tensor second_edge_index) -> ((Tensor, Tensor, Tensor)) (checkAndNormalizeInputs at /pytorch/aten/src/ATen/core/function_schema_inl.h:270)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f4953ac9813 in libc10.so)
frame #1: + 0x323bdea (0x7f4956f29dea in libtorch.so)
frame #2: torch::jit::Function::operator()(std::vector<c10::IValue, std::allocatorc10::IValue >, std::unordered_map<std::string, c10::IValue, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, c10::IValue> > > const&) + 0x36 (0x7f4956f280a6 in libtorch.so)
frame #3: torch::jit::script::Method::operator()(std::vector<c10::IValue, std::allocatorc10::IValue >, std::unordered_map<std::string, c10::IValue, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, c10::IValue> > > const&) + 0xc9 (0x7f4956ee6709 in libtorch.so)
frame #4: angel::TorchModel::forward(std::vector<c10::IValue, std::allocatorc10::IValue >) + 0xcc (0x7f4960c3c744 in /data/data3/yarn/nm2/usercache/service/filecache/39/libtorch_angel.so)
frame #5: angel::TorchModel::backward(std::vector<c10::IValue, std::allocatorc10::IValue >, at::Tensor) + 0x67 (0x7f4960c3cb77 in /data/data3/yarn/nm2/usercache/service/filecache/39/libtorch_angel.so)
frame #6: Java_com_tencent_angel_pytorch_Torch_gcnBackward + 0x33c (0x7f4960c31d70 in /data/data3/yarn/nm2/usercache/service/filecache/39/libtorch_angel.so)
frame #7: [0x7f49c49bd6c7]
很荣幸入选 Angel 项目,开始开源实战环节。能够和导师们、同学们共同学习、了解 Angel 分布式机器学习平台架构设计原理是个难得的机会。以下是本次开源活动的实战笔记。因本人水平有限,错误和不足之处在所难免,敬请各位专家读者指正。
本次项目是基于 Angel-ML/PyTorch-On-Angel 的一个论文复现,在进行其它工作之前,我们需要部署一个可以运行的环境。
PyTorch on Angel's architecture
PyTorch-On-Angel 主要由三个模块构成:
厘清依赖:
以下操作均基于 Ubuntu 20.04 LTS
,因为自用,环境不完全干净,不保证没有别的问题。
第一步当然是:
git clone https://github.com/Angel-ML/PyTorch-On-Angel.git --depth 1
项目文档中介绍了编译方法,出于使用方便,我准备好镜像源文件,放在下 ./addon
备用:
Debian 9 sources.list
:
deb http://mirrors.cloud.tencent.com/debian stretch main contrib non-free
deb http://mirrors.cloud.tencent.com/debian stretch-updates main contrib non-free
#deb http://mirrors.cloud.tencent.com/debian stretch-backports main contrib non-free
#deb http://mirrors.cloud.tencent.com/debian stretch-proposed-updates main contrib non-free
deb-src http://mirrors.cloud.tencent.com/debian stretch main contrib non-free
deb-src http://mirrors.cloud.tencent.com/debian stretch-updates main contrib non-free
#deb-src http://mirrors.cloud.tencent.com/debian stretch-backports main contrib non-free
#deb-src http://mirrors.cloud.tencent.com/debian stretch-proposed-updates main contrib non-free
maven settings.xml
:
<?xml version="1.0" encoding="UTF-8"?>
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">
<mirrors>
<mirror>
<id>nexus-tencentyun</id>
<mirrorOf>*</mirrorOf>
<name>Nexus tencentyun</name>
<url>http://mirrors.cloud.tencent.com/nexus/repository/maven-public/</url>
</mirror>
</mirrors>
</settings>
修改了 Dockerfile
:
########################################################################################################################
# DEV #
########################################################################################################################
FROM maven:3.6.1-jdk-8 as DEV
##########################
# install dependencies #
##########################
COPY ./addon/sources.list /etc/apt/sources.list
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
curl=7.52.1-5+deb9u9 \
g++=4:6.3.0-4 \
make=4.1-9.1 \
unzip=6.0-21+deb9u1 \
python3 \
python3-pip \
python3-setuptools \
python3-wheel \
&& rm -rf /var/lib/apt/lists/*
#####################
# Install PyTorch #
#####################
RUN python3 -m pip install --no-cache-dir -i https://mirrors.cloud.tencent.com/pypi/simple \
https://files.pythonhosted.org/packages/24/33/ccfe4e16bfa1f2ca10e22bca05b313cba31800f9597f5f282020cd6ba45e/torch-1.3.1-cp35-cp35m-manylinux1_x86_64.whl \
https://files.pythonhosted.org/packages/1c/f6/e927f7db4f422af037ca3f80b3391e6224ee3ee86473ea05028b2b026f82/torchvision-0.4.0-cp35-cp35m-manylinux1_x86_64.whl
#######################
# install new cmake #
#######################
RUN curl -fsSL --insecure -o /tmp/cmake.tar.gz https://cmake.org/files/v3.13/cmake-3.13.4.tar.gz \
&& tar -xzf /tmp/cmake.tar.gz -C /tmp \
&& rm -rf /tmp/cmake.tar.gz \
&& mv /tmp/cmake-* /tmp/cmake \
&& cd /tmp/cmake \
&& ./bootstrap \
&& make -j8 \
&& make install \
&& rm -rf /tmp/cmake
#######################
# download libtorch #
#######################
WORKDIR /opt
RUN curl -fsSL --insecure -o libtorch.zip https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-wit \
&& unzip -q libtorch.zip \
&& rm libtorch.zip
ENV TORCH_HOME=/opt/libtorch
########################################################################################################################
# JAVA BUILDER #
########################################################################################################################
FROM DEV as JAVA_BUILDER
COPY ./addon/settings.xml /usr/share/maven/conf/
WORKDIR /app
COPY ./java/pom.xml /app
RUN mvn -e -B dependency:resolve dependency:resolve-plugins
COPY ./java /app
RUN mvn -e -B -Dmaven.test.skip=true package
########################################################################################################################
# CPP BUILDER #
########################################################################################################################
FROM DEV as CPP_BUILDER
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
zip=3.0-11+b1 \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY ./cpp ./
RUN ./build.sh \
&& cp ./out/*.so "$TORCH_HOME"/lib \
&& cp /usr/lib/x86_64-linux-gnu/libstdc++.so.6 "$TORCH_HOME"/lib \
&& ln -s "$TORCH_HOME"/lib torch-lib \
&& zip -qr /torch.zip torch-lib
########################################################################################################################
# Artifacts #
########################################################################################################################
FROM alpine:3.10 as ARTIFACTS
WORKDIR /dist
COPY --from=CPP_BUILDER /torch.zip ./
COPY --from=JAVA_BUILDER /app/target/*.jar ./
VOLUME /output
CMD [ "/bin/sh", "-c", "cp ./* /output" ]
修改 cpp/CMakeList.txt
:
set(TORCH_HOME $ENV{TORCH_HOME})
执行 build.sh
静待片刻:
./build.sh
如果下载安装缓慢也可以提前在 addon
下准备好需要的文件并修改 Dockerfile
里相应部分:
cd addon && wget https://cmake.org/files/v3.13/cmake-3.13.4.tar.gz \
https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.3.1%2Bcpu.zip \
https://files.pythonhosted.org/packages/24/33/ccfe4e16bfa1f2ca10e22bca05b313cba31800f9597f5f282020cd6ba45e/torch-1.3.1-cp35-cp35m-manylinux1_x86_64.whl \
https://files.pythonhosted.org/packages/1c/f6/e927f7db4f422af037ca3f80b3391e6224ee3ee86473ea05028b2b026f82/torchvision-0.4.0-cp35-cp35m-manylinux1_x86_64.whl
修改 gen_pt_model.sh
python → python3
:
docker run -it --rm -v $(pwd)/${MODEL_PATH}:/model.py -v $(pwd)/dist:/output -w /output ${IMAGE_NAME} python3 /model.py ${@:2}
./dist
下就有了我们所需要的文件:
deepfm.pt pytorch-on-angel-0.2.0.jar pytorch-on-angel-0.2.0-jar-with-dependencies.jar torch.zip
第一步就完成了~
wget https://archive.apache.org/dist/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz
强迫症表示看到很多没用的文件就想删掉:
find . -name *.cmd | xargs rm
修改配置文件:
hadoop-env.sh
export JAVA_HOME="按情况修改"
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
到这里 HDFS
就设置完了,format
一下:
hdfs namenode –format
启动试试是否正常工作:启动需要能 SSH master worker,SSH 设置这里就略了
./start-dfs.sh
jps
# 105141 DataNode
# 104964 NameNode
# 105385 SecondaryNameNode
# 都有就是正常啦,没有的看看日志排查
mapred-site.xml
运行方式改成 yarn
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
yarn
的资源配置,默认是 8G
,跑 Angel 可能不够,根据自身电脑配置修改:
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>12</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>12</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>30720</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>30720</value>
</property>
</configuration>
启动试试是否正常工作:
./start-yarn.sh
jps
# 107761 ResourceManager
# 108141 NodeManager
# 都有就是正常啦,没有的看看日志排查
wget https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz
配置好 Hadoop 之后 Spark 的配置就比较简单了,Spark on YARN 可以直接从 Hadoop 的配置里读取,只需要修改:
spark-env.sh
export HADOOP_CONF_DIR="按情况修改"
启动试试是否正常工作:
./start-all.sh
jps
# 2273766 Worker
# 2273463 Master
# 都有就是正常啦,没有的看看日志排查
注意 jdk
版本,不然后续会报错
sudo apt install openjdk-8-jdk -y
sudo apt install maven -y
编译安装 protobuf 2.5.0
,依照 README.txt
即可,记得最后要 ldconfig
:
wget https://github.com/protocolbuffers/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
按照说明编译即可:
wget https://github.com/Angel-ML/angel/archive/refs/tags/Release-2.4.0.tar.gz
编译完成后解压,进行配置:
spark-on-angel-env.sh
export SPARK_HOME="按情况修改"
export ANGEL_HOME="按情况修改"
export ANGEL_HDFS_HOME="按情况修改"
export ANGEL_VERSION=2.4.0
# 部分 jar 包版本问题
angel_ps_external_jar=fastutil-7.1.0.jar,htrace-core-2.05.jar,sizeof-0.3.0.jar,kryo-shaded-4.0.0.jar,minlog-1.3.0.jar,memory-0.8.1.jar,commons-pool-1.6.jar,netty-all-4.1.18.Final.jar,hll-1.6.0.jar
sona_external_jar=fastutil-7.1.0.jar,htrace-core-2.05.jar,sizeof-0.3.0.jar,kryo-shaded-4.0.0.jar,minlog-1.3.0.jar,memory-0.8.1.jar,commons-pool-1.6.jar,netty-all-4.1.18.Final.jar,hll-1.6.0.jar,json4s-jackson_2.11-3.2.11.jar,json4s-ast_2.11-3.2.11.jar,json4s-core_2.11-3.2.11.jar
创建文件夹,把需要的文件放上 HDFS
备用
hdfs dfs -mkdir /angel
hdfs dfs -put ./angel/data/census/census_148d_train.libsvm /angel
hdfs dfs -put ./angel/lib /angel
把之前生成好的四个文件放在合适的位置:
torch.zip
pytorch-on-angel-0.2.0.jar
pytorch-on-angel-0.2.0-jar-with-dependencies.jar
deepfm.pt
spark-submit
配置参数按实际情况修改:
因为--archives torch.zip#torch
在我这一直不起作用,搜寻资料也没有结果,于是我解压了 torch.zip
,选择用 —-files
上传:
#!/bin/bash
JAVA_LIBRARY_PATH="按情况修改"
source ./angel/bin/spark-on-angel-env.sh
input="按情况修改"
output="按情况修改"
torchlib=torch-lib/libpthreadpool.a,torch-lib/libcpuinfo_internals.a,torch-lib/libCaffe2_perfkernels_avx2.a,torch-lib/libgmock.a,torch-lib/libprotoc.a,torch-lib/libnnpack.a,torch-lib/libgtest.a,torch-lib/libpytorch_qnnpack.a,torch-lib/libcaffe2_detectron_ops.so,torch-lib/libCaffe2_perfkernels_avx512.a,torch-lib/libgomp-753e6e92.so.1,torch-lib/libgloo.a,torch-lib/libonnx.a,torch-lib/libtorch_angel.so,torch-lib/libbenchmark_main.a,torch-lib/libcaffe2_protos.a,torch-lib/libgtest_main.a,torch-lib/libprotobuf-lite.a,torch-lib/libasmjit.a,torch-lib/libCaffe2_perfkernels_avx.a,torch-lib/libonnx_proto.a,torch-lib/libfoxi_loader.a,torch-lib/libfbgemm.a,torch-lib/libc10.so,torch-lib/libclog.a,torch-lib/libbenchmark.a,torch-lib/libgmock_main.a,torch-lib/libnnpack_reference_layers.a,torch-lib/libcaffe2_module_test_dynamic.so,torch-lib/libqnnpack.a,torch-lib/libprotobuf.a,torch-lib/libc10d.a,torch-lib/libtorch.so,torch-lib/libcpuinfo.a,torch-lib/libstdc++.so.6,torch-lib/libmkldnn.a
spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.ps.instances=1 \
--conf spark.ps.cores=1 \
--conf spark.ps.jars=$SONA_ANGEL_JARS \
--conf spark.ps.memory=5g \
--conf spark.ps.log.level=INFO \
--conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:. \
--conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:. \
--conf spark.executor.extraLibraryPath=. \
--conf spark.driver.extraLibraryPath=. \
--conf spark.executorEnv.OMP_NUM_THREADS=2 \
--conf spark.executorEnv.MKL_NUM_THREADS=2 \
--name "deepfm for torch on angel" \
--jars $SONA_SPARK_JARS \
--files deepfm.pt,$torchlib \
--driver-memory 5g \
--num-executors 1 \
--executor-cores 1 \
--executor-memory 5g \
--class com.tencent.angel.pytorch.examples.supervised.RecommendationExample pytorch-on-angel-0.2.0.jar \
trainInput:$input batchSize:128 torchModelPath:deepfm.pt \
stepSize:0.001 numEpoch:10 testRatio:0.1 \
angelModelOutputPath:$output
去 http://master:8088/cluster/apps 上收获成功吧!
hey, guys, I got a error at make stage
-- The C compiler identification is GNU 6.3.0
-- The CXX compiler identification is GNU 6.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found torch: /opt/libtorch/lib/libtorch.so
-- FOUND TORCH VERSION : 1.5.0
-- Found JNI: /usr/local/openjdk-8/jre/lib/amd64/libjawt.so
CMake Error at CMakeLists.txt:38 (add_subdirectory):
add_subdirectory given source "pytorch_scatter-2.0.5" which is not an
existing directory.
-- Configuring incomplete, errors occurred!
See also "/app/out/CMakeFiles/CMakeOutput.log".
See also "/app/out/CMakeFiles/CMakeError.log".
make: *** No targets specified and no makefile found. Stop.
The command '/bin/sh -c ./build.sh && cp ./out/*.so "$TORCH_HOME"/lib && cp /usr/lib/x86_64-linux-gnu/libstdc++.so.6 "$TORCH_HOME"/lib && ln -s "$TORCH_HOME"/lib torch-lib && zip -qr /torch.zip torch-lib' returned a non-zero code: 2
可以直接提供一个编译好的包吗,使用者不用在单独编译了,这样对用户来说非常友好,目前本地编译的时候各种各样的问题,浪费很多时间和精力,按照网上其他人的教程也有各种问题,比如包版本不存在,源不存在等等,非常非常耗费时间,把大量时间浪费第一步编译上的不值得的,麻烦看一下是否支持,非常感谢!
rename input parameters for edgeprop
fix bug about softmax
你好,问一下?
我在测试使用PyTorch-On-Angel ,提交报错了,在代码里找到的是com.tencent.angel.graph.utils.params.HasBatchSize,会是什么原因导致的呢?
param mode = yarn-client Exception in thread "main" java.lang.NoClassDefFoundError: com/tencent/angel/spark/ml/graph/params/HasBatchSize at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at com.tencent.angel.pytorch.examples.supervised.RecommendationExample$.main(RecommendationExample.scala:53) at com.tencent.angel.pytorch.examples.supervised.RecommendationExample.main(RecommendationExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.tencent.angel.spark.ml.graph.params.HasBatchSize at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 25 more
pytorch on angel分支0.3.0
angel版本3.2.0
提交参数
${SPARK_HOME}/bin/spark-submit
--master yarn
--deploy-mode cluster
--queue default
--conf spark.ps.instances=1
--conf spark.ps.cores=1
--conf spark.ps.jars=$SONA_ANGEL_JARS
--conf spark.ps.memory=4g
--conf spark.ps.log.level=INFO
--conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.
--conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.
--conf spark.executor.extraLibraryPath=.
--conf spark.driver.extraLibraryPath=.
--conf spark.executorEnv.OMP_NUM_THREADS=2
--conf spark.executorEnv.MKL_NUM_THREADS=2
--name "deepfm for torch on angel"
--jars $SONA_SPARK_JARS
--files deepfm.pt,$torchlib
--driver-memory 4g
--num-executors 1
--executor-cores 1
--executor-memory 4g
--class com.tencent.angel.pytorch.examples.supervised.RecommendationExample pytorch-on-angel-0.3.0.jar
trainInput:$input batchSize:128 torchModelPath:deepfm.pt
stepSize:0.001 numEpoch:10 testRatio:0.1
angelModelOutputPath:$output
abstract convolution operator for graphsage
support bipartite gnn on pytorch on angel
add esmm model
where is the spark-on-angel-env.sh in README.md
support edgeProp algorithm on pytorch on angel, which is an end-to-end Graph Convolution Network (GCN)-based algorithm to learn the embeddings of the nodes and edges of a large-scale time-evolving graph
support multi-label-classification and class-weight for rgcn
add feature importance for deep&wide
support GAT algorithm on pytorch on angel
support high-sparse feature for gnn on pytorch on angel
hey guys, I've encounted a promlem at make statge
PyTorch-On-Angel/cpp/include/angel/pytorch/model.h:71:33: error: ‘as_variable_ref’ is not a member of ‘torch::autograd’
return torch::autograd::as_variable_ref(p.value);
....
Cmake[2]: *** [CMakeFiles/torch_angel.dir/src/angel/pytorch/angel_torch.cc.o] Interrupt
make[2]: *** [CMakeFiles/torch_angel.dir/src/angel/pytorch/model.cc.o] Interrupt
make[1]: *** [CMakeFiles/torch_angel.dir/all] Interrupt
my env is:
python 3.7
jdk 1.8
maven 3.6.2
gcc 7.1.0
libtorch 1.9.0
pytorch 1.9.0
cmake 3.12.2
cuda 10.2
is anyone knows how to solve it? thanks a lot!
修改angel-ps与spark-on-angel的pom文件改为以上版本
apache-maven-3.8.1
hadoop-2.7.2
jdk1.8.0_161
protobuf-2.5.0
scala-2.11.8
spark-2.3.0-bin-hadoop2.7
angel-2.4.0-bin
http://hadoop001:8088/cluster/apps
在dockerfile里面添加
ENV Torch_DIR=/opt/libtorch/share/cmake/Torch
在dokerfile文件里面添加torchvision=0.4.2
source /home/liuqian/angel/angel/dist/target/angel-2.4.0-bin/bin/spark-on-angel-env.sh
把yarn.scheduler.capacity.maximum-am-resource-percent调到0.6
I'm currently attempting to train xDeepFM using the libffm format.
I noticed in the code that load_svmlight_file from sklearn is used to load the text file.
While load_svmlight_file works well for to load data in the libsvm format -
label feature1:value1 feature2:value2
With the libffm format
label field1:feature1:value1 field2:feature2:value2
I get the error
ValueError: could not convert string to float: b'0:0.11545'
Since load_svmlight_file ultimately converts to a sparse matrix, I can also convert my data directly to a sparse matrix. However, while it is obvious what the libsvm format would look like in matrix format, it isn't obvious what the libffm format would look like.
Has anyone successfully trained a xDeepFM using this repo? Please help!
21/02/04 15:13:00 ERROR ApplicationMaster: User class threw exception: java.lang.AbstractMethodError: com.tencent.angel.pytorch.graph.gcn.GCN.org$apache$spark$ml$param$Params$setter$paramMap_$eq(Lorg/apache/spark/ml/param/ParamMap;)V
java.lang.AbstractMethodError: com.tencent.angel.pytorch.graph.gcn.GCN.org$apache$spark$ml$param$Params$setter$paramMap_$eq(Lorg/apache/spark/ml/param/ParamMap;)V
at org.apache.spark.ml.param.Params$class.$init$(params.scala:868)
at com.tencent.angel.pytorch.graph.gcn.GNN.(GNN.scala:39)
at com.tencent.angel.pytorch.graph.gcn.GNN.(GNN.scala:46)
Since neighbors and types are initialized separately, each initialization is sorted by node keys, resulting in confusion of neighbors and types when angel.ps.router.type=range.
steps to get started with Tencent project—Angel
tab the link for more info
https://github.com/JeromeYHJ/start-on-Angel
support unsupervised graphsage on pytorch on angel
support IGMC alogrithm on pytorch on angel, which is an Inductive Graph-based Matrix Completion (IGMC) model to address the problem that not any side information is available other than the matrix to complete
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.