angel-ml / pytorch-on-angel Goto Github PK

View Code? Open in Web Editor NEW

164.0 164.0 51.0 78.52 MB

PyTorch On Angel, arming PyTorch with a powerful Parameter Server, which enable PyTorch to train very big models.

CMake 0.11% C++ 8.20% Java 3.00% Scala 69.44% Python 17.95% C 0.90% Dockerfile 0.34% Shell 0.05%

pytorch-on-angel's People

Contributors

Stargazers

Watchers

pytorch-on-angel's Issues

使用新版本跑示例程序出错

之前用0.2版本没有这个错误，不知道是因为更新了什么导致的

at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126) at org.apache.spark.rdd.RDD.count(RDD.scala:1168) at com.tencent.angel.pytorch.graph.gcn.GCN.makeGraph(GCN.scala:60) at com.tencent.angel.pytorch.graph.gcn.GNN.initialize(GNN.scala:99) at com.tencent.angel.pytorch.examples.supervised.cluster.GraphSageExample$.main(GraphSageExample.scala:150) at com.tencent.angel.pytorch.examples.supervised.cluster.GraphSageExample.main(GraphSageExample.scala) Caused by: com.tencent.angel.exception.AngelException: com.tencent.angel.exception.AngelException: node id is not in range [0, 10 at com.tencent.angel.psagent.matrix.MatrixClientImpl.get(MatrixClientImpl.java:732) at com.tencent.angel.spark.models.impl.PSVectorImpl.psfGet(PSVectorImpl.scala:78) at com.tencent.angel.pytorch.graph.gcn.GNNPSModel.readLabels2(GNNPSModel.scala:71) at com.tencent.angel.pytorch.graph.gcn.GraphAdjPartition.splitTrainTest(GraphPartition.scala:62) at com.tencent.angel.pytorch.graph.gcn.GraphAdjPartition.toSemiGCNPartition(GraphPartition.scala:49) at com.tencent.angel.pytorch.graph.gcn.GCN$$anonfun$4.apply(GCN.scala:54) at com.tencent.angel.pytorch.graph.gcn.GCN$$anonfun$4.apply(GCN.scala:54) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

modify CMakeList

Add some parameters to CMakeLists, in order to support GAT, HGAT, such as require torch version >=1.4.0, link torch_scatter library when use GAT, HGAT algorithm.

Integrated DGI algorithm

Integrate one-hop and two-hop for DGI algorithm

[New Feature] support dssm algorithm

support dssm algorithm on pytorch on angel

2021Tencent Rhino-bird Open-source Training Program—Angel Zhi Shen

腾讯犀牛鸟实战-Angel平台搭建和例程运行

关于运行平台

 1. 平台: AT平台的虚拟机都可以，另外一个云不行！！！
    实测另外一个平台搭建过程会报其他错，可能局域网有些其他设置或者hostname有问题吧。
 2. 编译方式: 本地编译，伪分布式配置，系统centOS 7.2。
 3. gcc: 7.3版本即可, cmake 3.21版本配置libtorch时候会报warning不知道会不会有问题，我后面换成3.12跑通的。
参考网页：
 centOS下gcc的版本升级：https://blog.csdn.net/ncdx111/article/details/106047228
 cmake下载安装：https://blog.csdn.net/weixin_30781433/article/details/98787965?utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-1.base&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-1.base

关于hadoop，spark和pytorch版本

1. hadoop: 版本选2.7.x即可，2.7.1和2.7.5亲测可用。
2. spark: 之前群里有人测过这里spark2.3.0是必须的, 2.4.0版本会报错。
3. pytorch: pytorch版本为1.3.1，torchvision为0.4.2。这里pytorch我理解是生成模型用的，不知道运行时候还需不需要了。
    libtorch：这里libtorch选1.3.1，pytorch官网有同版本libtorch，pre-ABI和ABI，编译时候都是对的，但有一个在运行时候会报 
    符号错误。我记得pre-ABI应该是可用的，记不太清楚了。

参考网页：
    hadoop搭建：https://blog.csdn.net/csdnmrliu/article/details/82963783（非源码编译下载后解压缩，配置环境变量即可）
    github下载加速：https://blog.csdn.net/haejwcalcv/article/details/108028245
    spark搭建：https://archive.apache.org/dist/，下载压缩包后解压缩然后配置conf文件夹后的env脚本即可

关于hadoop和spark的配置

配置之前有同学发了，可以直接用。

关于环境变量和环境变量可能导致问题

这几行到没有什么，java注意不要配错就行。

ANGEL包的地址配错，或者scla地址配错或者jar包遗漏都会报error,exit 0，具体看就是example.scala的第80行报错，也就是读取那块，我卡在这里卡了很久。

有需要可以参考一下，我不知道最后几个变量有没有用。

一些其他可能的问题

运行程序时候一直卡在accept状态:
如果不是命令行内存分配不对就是yarn给的不对，再不行就换机子，内存尽量配到30G感觉会比较好？
运行时候一直卡在RUNNING：
我后面重新配环境，这个问题就没有出现了，当时是ps一直没有启动，因为没有日志也不知道具体什么原因，如果遇到。。。自求多福吧
HDFS地址，其实hdfs不太重要，地址写不到都能从日志看到，慢慢改就好，但out的地址一定记得写到hdfs里面，不要写root！！！原因见下图：

惊不惊喜意不意外刺不刺激???跑完succeed然后把root目录整个删掉就很离谱。 4. queue提示找不到，按下面命令改。 5. 申请block失败。换云服务器2333，只在一个云服务器上遇到这个情况。

中间还有一些其他乱七八糟的问题，不太想回滚聊天记录了，应该都可以百度解决。
https://blog.csdn.net/qq_50665031/article/details/108987205 这是一个安装glbic2.23的网页，忘记做什么时候用到的了，有其他人遇到可以看一下。

###关于command命令

###运行成功截图

over.

Error occurred when maven compiling java : Could not find artifact edu.princeton.cs:algs4:jar:1.0.3

here is the maven ( mvn clean package -Dmaven.test.skip=true ) output log:

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  26.028 s
[INFO] Finished at: 2022-05-05T16:24:57Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project pytorch-on-angel: Could not resolve dependencies for project com.tencent.angel:pytorch-on-angel:jar:0.3.0: Could not find artifact edu.princeton.cs:algs4:jar:1.0.3 in central (https://repo.maven.apache.org/maven2) -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException

and this error also happended even if using Docker method recommended in Readme.md
how can I solve this problem? anyone please

请问支持angel3.0么？NoClassDefFoundError：com/tencent/angel/ml/matrix/RowType

跑demo的时候，发现

Exception in thread "main" java.lang.NoClassDefFoundError: com/tencent/angel/ml/matrix/RowType
at com.tencent.angel.pytorch.model.TorchParams.(TorchParams.scala:29)
at com.tencent.angel.pytorch.model.ParTorchModel.init(ParTorchModel.scala:69)
at com.tencent.angel.pytorch.examples.ClusterExample$.main(ClusterExample.scala:68)
at com.tencent.angel.pytorch.examples.ClusterExample.main(ClusterExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: com.tencent.angel.ml.matrix.RowType
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 16 more

这个类在44b7c476d0c6b2aab7399f5200f64c44aa973fb5这次提交的时候修改路径了，到angel-math-0.1.0.jar里去了，请问该怎么处理？

add docs for gnn algorithms

send stop command to Master failed

可以跑通spark on angel 的lr示例
无法跑通pytorch on angel的 deepfm示例

实际运行中，发现，起一个spark进程，两个 angel ps 进程。并且有一个angel ps进程会在spark进程之前结束。然后，运行日志中提示，send stop command to Master failed

求尽快解答！

2019-09-05 15:13:06 ERROR AngelClient:480 - send stop command to Master failed
com.google.protobuf.ServiceException: java.util.concurrent.ExecutionException: java.io.IOException: Error connecting to /13.190.232.43:21029
at com.tencent.angel.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:317)
at com.sun.proxy.$Proxy25.stop(Unknown Source)
at com.tencent.angel.client.AngelClient.stop(AngelClient.java:477)
at com.tencent.angel.client.AngelPSClient.stopPS(AngelPSClient.java:181)
at com.tencent.angel.spark.context.AngelPSContext$.doStop(AngelPSContext.scala:441)
at com.tencent.angel.spark.context.AngelPSContext$$anon$2.run(AngelPSContext.scala:323)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Error connecting to /10.110.132.43:21029
at com.tencent.angel.ipc.CallFuture.get(CallFuture.java:121)
at com.tencent.angel.ipc.NettyTransceiver.call(NettyTransceiver.java:297)
at com.tencent.angel.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:294)
... 6 more
Caused by: java.io.IOException: Error connecting to /10.110.132.43:21029
at com.tencent.angel.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:149)
at com.tencent.angel.ipc.NettyTransceiver.transceive(NettyTransceiver.java:338)
at com.tencent.angel.ipc.NettyTransceiver.call(NettyTransceiver.java:292)
... 7 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.110.132.43:21029
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused
... 11 more
2019-09-05 15:13:06 INFO YarnClientImpl:395 - Killed application application_1567663579362_0002
End of LogType:stdout

dependency issue

能否给一个教程，告诉我们要当前版本的pytorch-on-angel要搭配什么版本的angel以及其他的dependency？我反复尝试了很久，还是有各种version不匹配的问题。

请问训练数据是指支持libsvm么？如果是的话如何使用自定义dataloader呢？

[New Feature] add edge feature for gnn

add edge feature for gnn

[New Feature] support HGAT alogrithm

support HGAT alogrithm on pytorch on angel, which is a novel heterogeneous graph neural network.

DGI训练报错，使用官方文档中参数训练的时候出错

20/11/12 15:21:17 INFO BlockManager: Found block rdd_27_3 locally
20/11/12 15:21:17 INFO BlockManager: Found block rdd_27_0 locally
20/11/12 15:21:17 INFO BlockManager: Found block rdd_27_1 locally
20/11/12 15:21:17 INFO BlockManager: Found block rdd_27_2 locally
terminate called after throwing an instance of 'c10::Error'
what(): forward_() is missing value for argument 'second_edge_index'. Declaration: forward_(ClassType self, Tensor pos_x, Tensor neg_x, Tensor first_edge_index, Tensor second_edge_index) -> ((Tensor, Tensor, Tensor)) (checkAndNormalizeInputs at /pytorch/aten/src/ATen/core/function_schema_inl.h:270)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f4953ac9813 in libc10.so)
frame #1: + 0x323bdea (0x7f4956f29dea in libtorch.so)
frame #2: torch::jit::Function::operator()(std::vector<c10::IValue, std::allocatorc10::IValue >, std::unordered_map<std::string, c10::IValue, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, c10::IValue> > > const&) + 0x36 (0x7f4956f280a6 in libtorch.so)
frame #3: torch::jit::script::Method::operator()(std::vector<c10::IValue, std::allocatorc10::IValue >, std::unordered_map<std::string, c10::IValue, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, c10::IValue> > > const&) + 0xc9 (0x7f4956ee6709 in libtorch.so)
frame #4: angel::TorchModel::forward(std::vector<c10::IValue, std::allocatorc10::IValue >) + 0xcc (0x7f4960c3c744 in /data/data3/yarn/nm2/usercache/service/filecache/39/libtorch_angel.so)
frame #5: angel::TorchModel::backward(std::vector<c10::IValue, std::allocatorc10::IValue >, at::Tensor) + 0x67 (0x7f4960c3cb77 in /data/data3/yarn/nm2/usercache/service/filecache/39/libtorch_angel.so)
frame #6: Java_com_tencent_angel_pytorch_Torch_gcnBackward + 0x33c (0x7f4960c31d70 in /data/data3/yarn/nm2/usercache/service/filecache/39/libtorch_angel.so)
frame #7: [0x7f49c49bd6c7]

2021Tencent Rhino-bird Open-source Training Program—Angel Zeng Shang

第一次作业

很荣幸入选 Angel 项目，开始开源实战环节。能够和导师们、同学们共同学习、了解 Angel 分布式机器学习平台架构设计原理是个难得的机会。以下是本次开源活动的实战笔记。因本人水平有限，错误和不足之处在所难免，敬请各位专家读者指正。

Angel 环境搭建

本次项目是基于 Angel-ML/PyTorch-On-Angel 的一个论文复现，在进行其它工作之前，我们需要部署一个可以运行的环境。

PyTorch on Angel's architecture

PyTorch-On-Angel 主要由三个模块构成：

Python Client：用于生成 ScriptModule
Angel PS：参数服务器，负责模型的分布式存储、同步和协调计算
Spark：Spark Driver、Spark Executor 负责加载 ScriptModule，数据处理，同参数服务器协同完成模型的训练和预测

厘清依赖：

由 Python 代码生成 ScriptModule，需要 python 环境和 torch 包
使用 C++ 后端，需要 libtorch_angel
Angel PS 和 Spark Driver、Spark Executor 需要 Spark
项目中推荐使用 Spark on YARN 的方式，Hadoop 也是需要的

以下操作均基于 Ubuntu 20.04 LTS ，因为自用，环境不完全干净，不保证没有别的问题。

PyTorch-On-Angel

第一步当然是：

git clone https://github.com/Angel-ML/PyTorch-On-Angel.git --depth 1

项目文档中介绍了编译方法，出于使用方便，我准备好镜像源文件，放在下 ./addon 备用：

Debian 9 sources.list ：

deb http://mirrors.cloud.tencent.com/debian stretch main contrib non-free
deb http://mirrors.cloud.tencent.com/debian stretch-updates main contrib non-free
#deb http://mirrors.cloud.tencent.com/debian stretch-backports main contrib non-free
#deb http://mirrors.cloud.tencent.com/debian stretch-proposed-updates main contrib non-free
deb-src http://mirrors.cloud.tencent.com/debian stretch main contrib non-free
deb-src http://mirrors.cloud.tencent.com/debian stretch-updates main contrib non-free
#deb-src http://mirrors.cloud.tencent.com/debian stretch-backports main contrib non-free
#deb-src http://mirrors.cloud.tencent.com/debian stretch-proposed-updates main contrib non-free

maven settings.xml ：

<?xml version="1.0" encoding="UTF-8"?>
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">
  <mirrors>
    <mirror>
      <id>nexus-tencentyun</id>
      <mirrorOf>*</mirrorOf>
      <name>Nexus tencentyun</name>
      <url>http://mirrors.cloud.tencent.com/nexus/repository/maven-public/</url>
    </mirror>
  </mirrors>
</settings>

修改了 Dockerfile ：

########################################################################################################################
#                                                       DEV                                                            #
########################################################################################################################
FROM maven:3.6.1-jdk-8 as DEV

##########################
#  install dependencies  #
##########################
COPY ./addon/sources.list /etc/apt/sources.list
RUN apt-get update \
    && apt-get install -y --no-install-recommends \
    curl=7.52.1-5+deb9u9 \
    g++=4:6.3.0-4 \
    make=4.1-9.1 \
    unzip=6.0-21+deb9u1 \
    python3 \
    python3-pip \
    python3-setuptools \
    python3-wheel \
    && rm -rf /var/lib/apt/lists/*

#####################
#  Install PyTorch  #
#####################
RUN python3 -m pip install --no-cache-dir -i https://mirrors.cloud.tencent.com/pypi/simple \
    https://files.pythonhosted.org/packages/24/33/ccfe4e16bfa1f2ca10e22bca05b313cba31800f9597f5f282020cd6ba45e/torch-1.3.1-cp35-cp35m-manylinux1_x86_64.whl \
    https://files.pythonhosted.org/packages/1c/f6/e927f7db4f422af037ca3f80b3391e6224ee3ee86473ea05028b2b026f82/torchvision-0.4.0-cp35-cp35m-manylinux1_x86_64.whl

#######################
#  install new cmake  #
#######################
RUN curl -fsSL --insecure -o /tmp/cmake.tar.gz https://cmake.org/files/v3.13/cmake-3.13.4.tar.gz \
    && tar -xzf /tmp/cmake.tar.gz -C /tmp \
    && rm -rf /tmp/cmake.tar.gz  \
    && mv /tmp/cmake-* /tmp/cmake \
    && cd /tmp/cmake \
    && ./bootstrap \
    && make -j8 \
    && make install \
    && rm -rf /tmp/cmake

#######################
#  download libtorch  #
#######################
WORKDIR /opt
RUN curl -fsSL --insecure -o libtorch.zip https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-wit \
    && unzip -q libtorch.zip \
    && rm libtorch.zip

ENV TORCH_HOME=/opt/libtorch

########################################################################################################################
#                                                     JAVA BUILDER                                                     #
########################################################################################################################
FROM DEV as JAVA_BUILDER

COPY ./addon/settings.xml /usr/share/maven/conf/

WORKDIR /app

COPY ./java/pom.xml /app

RUN mvn -e -B dependency:resolve dependency:resolve-plugins

COPY ./java /app

RUN mvn -e -B -Dmaven.test.skip=true package

########################################################################################################################
#                                                     CPP BUILDER                                                      #
########################################################################################################################
FROM DEV as CPP_BUILDER

RUN apt-get update  \
    && apt-get install -y --no-install-recommends \
    zip=3.0-11+b1 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY ./cpp ./

RUN ./build.sh \
    && cp ./out/*.so "$TORCH_HOME"/lib \
    && cp /usr/lib/x86_64-linux-gnu/libstdc++.so.6 "$TORCH_HOME"/lib \
    && ln -s "$TORCH_HOME"/lib torch-lib \
    && zip -qr /torch.zip torch-lib

########################################################################################################################
#                                                       Artifacts                                                      #
########################################################################################################################
FROM alpine:3.10 as ARTIFACTS

WORKDIR /dist
COPY --from=CPP_BUILDER /torch.zip ./
COPY --from=JAVA_BUILDER /app/target/*.jar ./

VOLUME /output

CMD [ "/bin/sh", "-c", "cp ./* /output" ]

修改 cpp/CMakeList.txt ：

set(TORCH_HOME $ENV{TORCH_HOME})

执行 build.sh 静待片刻：

./build.sh

如果下载安装缓慢也可以提前在 addon 下准备好需要的文件并修改 Dockerfile 里相应部分：

cd addon && wget https://cmake.org/files/v3.13/cmake-3.13.4.tar.gz \
    https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.3.1%2Bcpu.zip \
    https://files.pythonhosted.org/packages/24/33/ccfe4e16bfa1f2ca10e22bca05b313cba31800f9597f5f282020cd6ba45e/torch-1.3.1-cp35-cp35m-manylinux1_x86_64.whl \
    https://files.pythonhosted.org/packages/1c/f6/e927f7db4f422af037ca3f80b3391e6224ee3ee86473ea05028b2b026f82/torchvision-0.4.0-cp35-cp35m-manylinux1_x86_64.whl

修改 gen_pt_model.sh python → python3：

docker run -it --rm -v $(pwd)/${MODEL_PATH}:/model.py -v $(pwd)/dist:/output -w /output ${IMAGE_NAME} python3 /model.py ${@:2}

./dist 下就有了我们所需要的文件：

deepfm.pt  pytorch-on-angel-0.2.0.jar  pytorch-on-angel-0.2.0-jar-with-dependencies.jar  torch.zip

第一步就完成了~

Hadoop

wget https://archive.apache.org/dist/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz

强迫症表示看到很多没用的文件就想删掉：

find . -name *.cmd | xargs rm

修改配置文件：

hadoop-env.sh

export JAVA_HOME="按情况修改"

core-site.xml

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://master:9000</value>
  </property>
</configuration>

到这里 HDFS 就设置完了，format 一下：

hdfs namenode –format

启动试试是否正常工作：启动需要能 SSH master worker，SSH 设置这里就略了

./start-dfs.sh
jps
# 105141 DataNode
# 104964 NameNode
# 105385 SecondaryNameNode
# 都有就是正常啦，没有的看看日志排查

mapred-site.xml 运行方式改成 yarn

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

yarn-site.xml yarn 的资源配置，默认是 8G ，跑 Angel 可能不够，根据自身电脑配置修改：

<configuration>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>master</value>
  </property>
  <property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>12</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-vcores</name>
    <value>1</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-vcores</name>
    <value>12</value>
  </property>
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>30720</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>1</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>30720</value>
  </property>
</configuration>

启动试试是否正常工作：

./start-yarn.sh
jps
# 107761 ResourceManager
# 108141 NodeManager
# 都有就是正常啦，没有的看看日志排查

Spark

wget https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz

配置好 Hadoop 之后 Spark 的配置就比较简单了，Spark on YARN 可以直接从 Hadoop 的配置里读取，只需要修改：

spark-env.sh

export HADOOP_CONF_DIR="按情况修改"

启动试试是否正常工作：

./start-all.sh
jps
# 2273766 Worker
# 2273463 Master
# 都有就是正常啦，没有的看看日志排查

Angel

注意 jdk 版本，不然后续会报错

sudo apt install openjdk-8-jdk -y
sudo apt install maven -y

编译安装 protobuf 2.5.0 ，依照 README.txt 即可，记得最后要 ldconfig :

wget https://github.com/protocolbuffers/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz

按照说明编译即可：

wget https://github.com/Angel-ML/angel/archive/refs/tags/Release-2.4.0.tar.gz

编译完成后解压，进行配置：

spark-on-angel-env.sh

export SPARK_HOME="按情况修改"
export ANGEL_HOME="按情况修改"
export ANGEL_HDFS_HOME="按情况修改"
export ANGEL_VERSION=2.4.0

# 部分 jar 包版本问题
angel_ps_external_jar=fastutil-7.1.0.jar,htrace-core-2.05.jar,sizeof-0.3.0.jar,kryo-shaded-4.0.0.jar,minlog-1.3.0.jar,memory-0.8.1.jar,commons-pool-1.6.jar,netty-all-4.1.18.Final.jar,hll-1.6.0.jar
sona_external_jar=fastutil-7.1.0.jar,htrace-core-2.05.jar,sizeof-0.3.0.jar,kryo-shaded-4.0.0.jar,minlog-1.3.0.jar,memory-0.8.1.jar,commons-pool-1.6.jar,netty-all-4.1.18.Final.jar,hll-1.6.0.jar,json4s-jackson_2.11-3.2.11.jar,json4s-ast_2.11-3.2.11.jar,json4s-core_2.11-3.2.11.jar

创建文件夹，把需要的文件放上 HDFS 备用

hdfs dfs -mkdir /angel
hdfs dfs -put ./angel/data/census/census_148d_train.libsvm /angel
hdfs dfs -put ./angel/lib /angel

把之前生成好的四个文件放在合适的位置：

torch.zip pytorch-on-angel-0.2.0.jar pytorch-on-angel-0.2.0-jar-with-dependencies.jar deepfm.pt

spark-submit 配置参数按实际情况修改：

因为--archives torch.zip#torch 在我这一直不起作用，搜寻资料也没有结果，于是我解压了 torch.zip，选择用 —-files 上传：

#!/bin/bash
JAVA_LIBRARY_PATH="按情况修改"
source ./angel/bin/spark-on-angel-env.sh
input="按情况修改"
output="按情况修改"
torchlib=torch-lib/libpthreadpool.a,torch-lib/libcpuinfo_internals.a,torch-lib/libCaffe2_perfkernels_avx2.a,torch-lib/libgmock.a,torch-lib/libprotoc.a,torch-lib/libnnpack.a,torch-lib/libgtest.a,torch-lib/libpytorch_qnnpack.a,torch-lib/libcaffe2_detectron_ops.so,torch-lib/libCaffe2_perfkernels_avx512.a,torch-lib/libgomp-753e6e92.so.1,torch-lib/libgloo.a,torch-lib/libonnx.a,torch-lib/libtorch_angel.so,torch-lib/libbenchmark_main.a,torch-lib/libcaffe2_protos.a,torch-lib/libgtest_main.a,torch-lib/libprotobuf-lite.a,torch-lib/libasmjit.a,torch-lib/libCaffe2_perfkernels_avx.a,torch-lib/libonnx_proto.a,torch-lib/libfoxi_loader.a,torch-lib/libfbgemm.a,torch-lib/libc10.so,torch-lib/libclog.a,torch-lib/libbenchmark.a,torch-lib/libgmock_main.a,torch-lib/libnnpack_reference_layers.a,torch-lib/libcaffe2_module_test_dynamic.so,torch-lib/libqnnpack.a,torch-lib/libprotobuf.a,torch-lib/libc10d.a,torch-lib/libtorch.so,torch-lib/libcpuinfo.a,torch-lib/libstdc++.so.6,torch-lib/libmkldnn.a

spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --conf spark.ps.instances=1 \
    --conf spark.ps.cores=1 \
    --conf spark.ps.jars=$SONA_ANGEL_JARS \
    --conf spark.ps.memory=5g \
    --conf spark.ps.log.level=INFO \
    --conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:. \
    --conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:. \
    --conf spark.executor.extraLibraryPath=. \
    --conf spark.driver.extraLibraryPath=. \
    --conf spark.executorEnv.OMP_NUM_THREADS=2 \
    --conf spark.executorEnv.MKL_NUM_THREADS=2 \
    --name "deepfm for torch on angel" \
    --jars $SONA_SPARK_JARS \
    --files deepfm.pt,$torchlib \
    --driver-memory 5g \
    --num-executors 1 \
    --executor-cores 1 \
    --executor-memory 5g \
    --class com.tencent.angel.pytorch.examples.supervised.RecommendationExample pytorch-on-angel-0.2.0.jar \
    trainInput:$input batchSize:128 torchModelPath:deepfm.pt \
    stepSize:0.001 numEpoch:10 testRatio:0.1 \
    angelModelOutputPath:$output

去 http://master:8088/cluster/apps 上收获成功吧！

CMake error: add_subdirectory given source "pytorch_scatter-2.0.5" which is not an existing directory

hey, guys, I got a error at make stage

-- The C compiler identification is GNU 6.3.0
-- The CXX compiler identification is GNU 6.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found torch: /opt/libtorch/lib/libtorch.so
-- FOUND TORCH VERSION : 1.5.0
-- Found JNI: /usr/local/openjdk-8/jre/lib/amd64/libjawt.so
CMake Error at CMakeLists.txt:38 (add_subdirectory):
add_subdirectory given source "pytorch_scatter-2.0.5" which is not an
existing directory.

-- Configuring incomplete, errors occurred!
See also "/app/out/CMakeFiles/CMakeOutput.log".
See also "/app/out/CMakeFiles/CMakeError.log".
make: *** No targets specified and no makefile found. Stop.
The command '/bin/sh -c ./build.sh && cp ./out/*.so "$TORCH_HOME"/lib && cp /usr/lib/x86_64-linux-gnu/libstdc++.so.6 "$TORCH_HOME"/lib && ln -s "$TORCH_HOME"/lib torch-lib && zip -qr /torch.zip torch-lib' returned a non-zero code: 2

编译问题

可以直接提供一个编译好的包吗，使用者不用在单独编译了，这样对用户来说非常友好，目前本地编译的时候各种各样的问题，浪费很多时间和精力，按照网上其他人的教程也有各种问题，比如包版本不存在，源不存在等等，非常非常耗费时间，把大量时间浪费第一步编译上的不值得的，麻烦看一下是否支持，非常感谢！

请问如何在GCN训练中设置使用GPU？

rename input parameters for edgeprop

[Bug]fix bug about softmax

fix bug about softmax

报错：NoClassDefFoundError: com/tencent/angel/spark/ml/graph/params/HasBatchSize

你好，问一下？
我在测试使用PyTorch-On-Angel ，提交报错了，在代码里找到的是com.tencent.angel.graph.utils.params.HasBatchSize，会是什么原因导致的呢？

param mode = yarn-client Exception in thread "main" java.lang.NoClassDefFoundError: com/tencent/angel/spark/ml/graph/params/HasBatchSize at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at com.tencent.angel.pytorch.examples.supervised.RecommendationExample$.main(RecommendationExample.scala:53) at com.tencent.angel.pytorch.examples.supervised.RecommendationExample.main(RecommendationExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.tencent.angel.spark.ml.graph.params.HasBatchSize at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 25 more

pytorch on angel跑在yarn上报错User class threw exception: java.lang.UnsatisfiedLinkError: /data/hadoop/yarn/local/usercache/hdfs/filecache/5907/libtorch_angel.so: libtorchscatter.so

pytorch on angel分支0.3.0
angel版本3.2.0
提交参数
${SPARK_HOME}/bin/spark-submit
--master yarn
--deploy-mode cluster
--queue default
--conf spark.ps.instances=1
--conf spark.ps.cores=1
--conf spark.ps.jars=$SONA_ANGEL_JARS
--conf spark.ps.memory=4g
--conf spark.ps.log.level=INFO
--conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.
--conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.
--conf spark.executor.extraLibraryPath=.
--conf spark.driver.extraLibraryPath=.
--conf spark.executorEnv.OMP_NUM_THREADS=2
--conf spark.executorEnv.MKL_NUM_THREADS=2
--name "deepfm for torch on angel"
--jars $SONA_SPARK_JARS
--files deepfm.pt,$torchlib
--driver-memory 4g
--num-executors 1
--executor-cores 1
--executor-memory 4g
--class com.tencent.angel.pytorch.examples.supervised.RecommendationExample pytorch-on-angel-0.3.0.jar
trainInput:$input batchSize:128 torchModelPath:deepfm.pt
stepSize:0.001 numEpoch:10 testRatio:0.1
angelModelOutputPath:$output

模型训练报错

User class threw exception: java.lang.UnsatisfiedLinkError: no torch_angel in java.library.path
java.lang.UnsatisfiedLinkError: no torch_angel in java.library.path

does pytorch on angel just support on cpu cluster?

[New Feature] abstract convolution operator for graphsage

abstract convolution operator for graphsage

[New Feature] support bipartite gnn

support bipartite gnn on pytorch on angel

add esmm

add esmm model

spark-on-angel-env.sh not found

where is the spark-on-angel-env.sh in README.md

[New Feature] support edgeProp alogrithm

support edgeProp algorithm on pytorch on angel, which is an end-to-end Graph Convolution Network (GCN)-based algorithm to learn the embeddings of the nodes and edges of a large-scale time-evolving graph

[New Feature] support multi-label-classification and class-weight for rgcn

support multi-label-classification and class-weight for rgcn

[New Feature] add feature importance for deep&wide

add feature importance for deep&wide

make failed

I follow the instruction[Compilation & Deployment Instructions] to compile and deploy on my mac. But the following error occurred.

Seems "jni_md.h" is not found on the specified directory. How can deal with it?

[New Feature] support GAT algorithm

support GAT algorithm on pytorch on angel

[New Feature] support high-sparse feature for gnn

support high-sparse feature for gnn on pytorch on angel

make failed at: error: ‘as_variable_ref’ is not a member of ‘torch::autograd’ return torch::autograd

hey guys, I've encounted a promlem at make statge

PyTorch-On-Angel/cpp/include/angel/pytorch/model.h:71:33: error: ‘as_variable_ref’ is not a member of ‘torch::autograd’
return torch::autograd::as_variable_ref(p.value);
....
Cmake[2]: *** [CMakeFiles/torch_angel.dir/src/angel/pytorch/angel_torch.cc.o] Interrupt
make[2]: *** [CMakeFiles/torch_angel.dir/src/angel/pytorch/model.cc.o] Interrupt
make[1]: *** [CMakeFiles/torch_angel.dir/all] Interrupt

my env is:
python 3.7
jdk 1.8
maven 3.6.2
gcc 7.1.0
libtorch 1.9.0
pytorch 1.9.0
cmake 3.12.2
cuda 10.2

is anyone knows how to solve it? thanks a lot!

add sgcn

2021Tencent Rhino-bird Open-source Training Program—Angel--刘倩

一、 angel 算法案例

1.1 LR-spark-on-angel输出

1.2 Debug

1. netty-all-4.1.1.Final.jar与json4s-jackson_2.11-3.4.2.jar版本问题

修改angel-ps与spark-on-angel的pom文件改为以上版本

2. 跑通项目的软件版本

apache-maven-3.8.1
hadoop-2.7.2
jdk1.8.0_161
protobuf-2.5.0
scala-2.11.8
spark-2.3.0-bin-hadoop2.7
angel-2.4.0-bin

二、 Pytorch on angel 算法案例

1.1 deepfm for torch on angel输出

http://hadoop001:8088/cluster/apps

1.2 Debug

1. cmake报错

在dockerfile里面添加
ENV Torch_DIR=/opt/libtorch/share/cmake/Torch

2. pytorch版本和torchvision版本不对应

在dokerfile文件里面添加torchvision=0.4.2

3. spark-submit提交脚本

source /home/liuqian/angel/angel/dist/target/angel-2.4.0-bin/bin/spark-on-angel-env.sh

4.内存问题

把yarn.scheduler.capacity.maximum-am-resource-percent调到0.6

5.提交脚本内存分配不合理

ps log

换一台物理内存大的机器，重新配置跟之前一样的环境，yarn设置和提交脚本如下：

2021Tencent Rhino-bird Open-source Training Program—Angel YuZhengze

TODO

[ASK] Issue ingesting libffm format for xDeepFM

I'm currently attempting to train xDeepFM using the libffm format.
I noticed in the code that load_svmlight_file from sklearn is used to load the text file.
While load_svmlight_file works well for to load data in the libsvm format -

label feature1:value1 feature2:value2

With the libffm format

label field1:feature1:value1 field2:feature2:value2

I get the error

ValueError: could not convert string to float: b'0:0.11545'

Since load_svmlight_file ultimately converts to a sparse matrix, I can also convert my data directly to a sparse matrix. However, while it is obvious what the libsvm format would look like in matrix format, it isn't obvious what the libffm format would look like.

Has anyone successfully trained a xDeepFM using this repo? Please help!

User class threw exception: java.lang.AbstractMethodError

21/02/04 15:13:00 ERROR ApplicationMaster: User class threw exception: java.lang.AbstractMethodError: com.tencent.angel.pytorch.graph.gcn.GCN.org$apache$spark$ml$param$Params$setter$paramMap_$eq(Lorg/apache/spark/ml/param/ParamMap;)V
java.lang.AbstractMethodError: com.tencent.angel.pytorch.graph.gcn.GCN.org$apache$spark$ml$param$Params$setter$paramMap_$eq(Lorg/apache/spark/ml/param/ParamMap;)V
at org.apache.spark.ml.param.Params$class.$init$(params.scala:868)
at com.tencent.angel.pytorch.graph.gcn.GNN.(GNN.scala:39)
at com.tencent.angel.pytorch.graph.gcn.GNN.(GNN.scala:46)

angel-ml / pytorch-on-angel Goto Github PK

pytorch-on-angel's People

Contributors

Stargazers

Watchers

Forkers

pytorch-on-angel's Issues

腾讯犀牛鸟实战-Angel平台搭建和例程运行

关于运行平台

关于hadoop，spark和pytorch版本

关于hadoop和spark的配置

关于环境变量和环境变量可能导致问题

一些其他可能的问题

第一次作业

Angel 环境搭建

PyTorch-On-Angel

Hadoop

Spark

Angel

一、 angel 算法案例

1.1 LR-spark-on-angel输出

1.2 Debug

1. netty-all-4.1.1.Final.jar与json4s-jackson_2.11-3.4.2.jar版本问题

2. 跑通项目的软件版本

二、 Pytorch on angel 算法案例

1.1 deepfm for torch on angel输出

1.2 Debug

1. cmake报错

2. pytorch版本和torchvision版本不对应

3. spark-submit提交脚本

4.内存问题

5.提交脚本内存分配不合理

TODO

Recommend Projects

Recommend Topics

Recommend Org