分享另外一篇: 基于Hadoop分布式群YARN模式的搭建 官方指导链接:https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_YARN,这篇文章过于简略,很多地方都没有说明白,初学者会绕很多弯路,也许是因为这个项目刚刚开源不久,很多指导性的说明都没有。简言之,官网的 Instructions 略坑,搭建成功与否靠人品。 以下是本人实践成功的步骤: 环境准备 已经安装了 Spark 的 Hadoop 分布式集群环境(安装于 Ubuntu Server 系统),下表显示了我的集群环境( Spark 已经开启): 安装了 Spark 的 Hadoop 分布式集群环境
2 在 Master 节点上进行: (1) 安装 Python 2.7 这一步的是在本地文件夹里下载安装 Python ,目的是在进行任务分发的时候能够把这个 Python 和其他依赖环境(这里指的是包含 TensorFlow)同时分发给对应的 Spark executor ,所以这一步不是单纯的只安装 Python 。 [mw_shl_code=bash,true]#下载解压 Python 2.7 export PYTHON_ROOT=~/Python curl -O https://www.python.org/ftp/python/2.7.12/Python-2.7.12.tgz tar -xvf Python-2.7.12.tgz rm Python-2.7.12.tgz #在编译 Python 之前,需要完成以下工作,否则编译产生的 python 会出现没有 zlib 、没有 SSL 模块等错误 #安装 ruby , zlib , ssl 相关包 sudo apt install ruby sudo apt install zlib1g,zlib1g.dev sudo apt install libssl-dev #进入刚才解压的 Python 目录,修改 Modules/Setup.dist文件,该文件是用于产生 Python 相关配置文件的 cd Python-2.7.12 sudo vim Modules/Setup.dist #去掉 ssl , zlib 相关的注释: #与 ssl 相关: #SSL=/usr/local/ssl #_ssl _ssl.c \ # -DUSE_SSL -I$(SSL)/include -I$(SSL)/include/openssl \ # -L$(SSL)/lib -lssl -lcrypto #与 zlib 相关: #zlib zlibmodule.c -I$(prefix)/include -L$(exec_prefix)/lib -lz #编译 Python 到本地文件夹 ${PYTHON_ROOT} pushd Python-2.7.12 ./configure --prefix="${PYTHON_ROOT}" --enable-unicode=ucs4 #./configure 这一步如果提示 "no acceptable C compiler" , 需要安装 gcc : sudo apt install gcc make make install popd rm -rf Python-2.7.12 #安装 pip pushd "${PYTHON_ROOT}" curl -O https://bootstrap.pypa.io/get-pip.py bin/python get-pip.py rm get-pip.py #安装 pydoop ${PYTHON_ROOT}/bin/pip install pydoop #这里安装 pydoop,若是提示错误:LocalModeNotSupported,直接下载资源包通过 setup 安装: #资源包下载:https://pypi.python.org/pypi/pydoop #tar -xvf pydoop-1.2.0.tar.gz #cd pydoop-1.2.0 #python setup.py build #python setup.py install #安装 TensorFlow ${PYTHON_ROOT}/bin/pip install tensorflow #这里安装 tensorflow ,不需要从源码进行编译安装,除非你需要 RDMA 这个特性 popd[/mw_shl_code] (2) 安装 TensorFlowOnSpark [mw_shl_code=bash,true]git clone https://github.com/yahoo/TensorFlowOnSpark.gitgit clone https://github.com/yahoo/tensorflow.git #从 yahoo 下载的 TensorFlowOnSpark 资源包里面 tensorflow 是空的文件夹,git 上该文件夹连接到了 yahoo/tensorflow ,这里需要我们将 tensorflow 拷贝到 TensorFlowO#nSpark 下面: sudo rm -rf TensorFlowOnSpark/tensorflow/ sudo mv tensorflow/ TensorFlowOnSpark/ #上面 tensorflow 和 TensorFlowOnSpark 文件目录根据自己的实际情况修改,我的是在根目录下面,所以如上[/mw_shl_code] (3) 安装编译 Hadoop InputFormat/OutputFormat for TFRecords [mw_shl_code=bash,true]#安装以下依赖工具:sudo apt install autoconf,automake,libtool,curl,make,g++,unzip,maven #先安装 protobuf # git 上有详细说明:github上有详细的安装说明:https://github.com/google/protobuf/blob/master/src/README.md #也可以参考链接:http://www.itdadao.com/articles/c15a1006495p0.html #编译 TensorFlow 的 protos: git clone https://github.com/tensorflow/ecosystem.git cd ecosystem/hadoop protoc --proto_path=/opt/TensorFlowOnSpark/tensorflow/ --java_out=src/main/java/ /opt/TensorFlowOnSpark/tensorflow/tensorflow/core/example/{example, feature}.proto mvn clean package mvn install #将上一步产生的 jar 包上传到 HDFS cd target hadoop fs -put tensorflow-hadoop-1.0-SNAPSHOT.jar[/mw_shl_code] (4) 为 Spark 准备 Python with TensorFlow 压缩包 [mw_shl_code=bash,true]pushd "${PYTHON_ROOT}"zip -r Python.zip * popd #将该压缩包上传到 HDFS hadoop fs -put ${PYTHON_ROOT}/Python.zip[/mw_shl_code] (5) 为 Spark 准备 TensorFlowOnSpark 压缩包 [mw_shl_code=bash,true]pushd TensorFlowOnSpark/srczip -r ../tfspark.zip * popd[/mw_shl_code] 环境基本搭建完成。 测试[5] 1)准备数据 下载、压缩mnist数据集 [mw_shl_code=bash,true]mkdir ${HOME}/mnist pushd ${HOME}/mnist >/dev/null curl -O "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz" curl -O "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz" curl -O "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz" curl -O "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz" zip -r mnist.zip * popd >/dev/null[/mw_shl_code] 2)feed_dic方式运行,步骤如下 [mw_shl_code=bash,true]# step 1 设置环境变量 export PYTHON_ROOT=~/Python export LD_LIBRARY_PATH=${PATH} export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=Python/bin/python" export PATH=${PYTHON_ROOT}/bin/:$PATH export QUEUE=default # step 2 上传文件到hdfs hdfs dfs -rm /user/${USER}/mnist/mnist.zip hdfs dfs -put ${HOME}/MLdata/mnist/mnist.zip /user/${USER}/mnist/mnist.zip # step 3 将图像文件(images)和标签(labels)转换为CSV文件 hdfs dfs -rm -r /user/${USER}/mnist/csv ${SPARK_HOME}/bin/spark-submit \ --master yarn \ --deploy-mode cluster \ --queue ${QUEUE} \ --num-executors 4 \ --executor-memory 4G \ --archives hdfs:///user/${USER}/Python.zip#Python,hdfs:///user/${USER}/mnist/mnist.zip#mnist \ TensorFlowOnSpark/examples/mnist/mnist_data_setup.py \ --output mnist/csv \ --format csv # step 4 训练(train) hadoop fs -rm -r mnist_model ${SPARK_HOME}/bin/spark-submit \ --master yarn \ --deploy-mode cluster \ --queue ${QUEUE} \ --num-executors 3 \ --executor-memory 8G \ --py-files ${HOME}/TensorFlowOnSpark/tfspark.zip,${HOME}/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.yarn.executor.memoryOverhead=6144 \ --archives hdfs:///user/${USER}/Python.zip#Python \ --conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" \ ${HOME}/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \ --images mnist/csv/train/images \ --labels mnist/csv/train/labels \ --mode train \ --model mnist_model # step 5 推断(inference) hadoop fs -rm -r predictions ${SPARK_HOME}/bin/spark-submit \ --master yarn \ --deploy-mode cluster \ --queue ${QUEUE} \ --num-executors 3 \ --executor-memory 8G \ --py-files ${HOME}/TensorFlowOnSpark/tfspark.zip,${HOME}/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.yarn.executor.memoryOverhead=6144 \ --archives hdfs:///user/${USER}/Python.zip#Python \ --conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" \ ${HOME}/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \ --images mnist/csv/test/images \ --labels mnist/csv/test/labels \ --mode inference \ --model mnist_model \ --output predictions # step 6 查看结果(可能有多个文件) hdfs dfs -ls predictions hdfs dfs -cat predictions/part-00001 hdfs dfs -cat predictions/part-00002 hdfs dfs -cat predictions/part-00003 #网页方式,查看spark作业运行情况,这里的 ip 地址是 master 节点的 ip http://192.168.0.20:8088/cluster/apps/[/mw_shl_code] 3) queuerunner方式运行,步骤如下 [mw_shl_code=bash,true]# step 1 设置环境变量 export PYTHON_ROOT=~/Python export LD_LIBRARY_PATH=${PATH} export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=Python/bin/python" export PATH=${PYTHON_ROOT}/bin/:$PATH export QUEUE=default # step 2 上传文件到hdfs hdfs dfs -rm /user/${USER}/mnist/mnist.zip hdfs dfs -rm -r /user/${USER}/mnist/tfr hdfs dfs -put ${HOME}/MLdata/mnist/mnist.zip /user/${USER}/mnist/mnist.zip # step 3 将图像文件(images)和标签(labels)转换为TFRecords ${SPARK_HOME}/bin/spark-submit \ --master yarn \ --deploy-mode cluster \ --queue ${QUEUE} \ --num-executors 4 \ --executor-memory 4G \ --archives hdfs:///user/${USER}/Python.zip#Python,hdfs:///user/${USER}/mnist/mnist.zip#mnist \ --jars hdfs:///user/${USER}/tensorflow-hadoop-1.0-SNAPSHOT.jar \ ${HOME}/TensorFlowOnSpark/examples/mnist/mnist_data_setup.py \ --output mnist/tfr \ --format tfr # step 4 训练(train) hadoop fs -rm -r mnist_model ${SPARK_HOME}/bin/spark-submit \ --master yarn \ --deploy-mode cluster \ --queue ${QUEUE} \ --num-executors 4 \ --executor-memory 4G \ --py-files ${HOME}/TensorFlowOnSpark/tfspark.zip,${HOME}/TensorFlowOnSpark/examples/mnist/tf/mnist_dist.py \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.yarn.executor.memoryOverhead=4096 \ --archives hdfs:///user/${USER}/Python.zip#Python \ --conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" \ ${HOME}/TensorFlowOnSpark/examples/mnist/tf/mnist_spark.py \ --images mnist/tfr/train \ --format tfr \ --mode train \ --model mnist_model # step 5 推断(inference) hadoop fs -rm -r predictions ${SPARK_HOME}/bin/spark-submit \ --master yarn \ --deploy-mode cluster \ --queue ${QUEUE} \ --num-executors 4 \ --executor-memory 4G \ --py-files ${HOME}/TensorFlowOnSpark/tfspark.zip,${HOME}/TensorFlowOnSpark/examples/mnist/tf/mnist_dist.py \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.yarn.executor.memoryOverhead=4096 \ --archives hdfs:///user/${USER}/Python.zip#Python \ --conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" \ ${HOME}/TensorFlowOnSpark/examples/mnist/tf/mnist_spark.py \ --images mnist/tfr/test \ --mode inference \ --model mnist_model \ --output predictions # step 6 查看结果(可能有多个文件) hdfs dfs -ls predictions hdfs dfs -cat predictions/part-00001 hdfs dfs -cat predictions/part-00002 hdfs dfs -cat predictions/part-00003 #网页方式,查看spark作业运行情况,这里的 ip 地址是 master 节点对应的 ip http://192.168.0.20:8088/cluster/apps/[/mw_shl_code] 参考: https://www.cnblogs.com/heimianshusheng/p/6768019.html 作者:1010 |