20170614-日报

作者: eosclover | 来源:发表于2017-06-14 23:38 被阅读23次

今天周三

9.00之前到公司，做自己感兴趣的知识

将前同事分享的PPT的内容整理出来，了解，看懂。tensorflow on spark

一、why tensorflowonspark at yahoo

1.major contributor to open-source hadoop ecosystem

(1)originator of hadoop (2006)

(2)an early adopter of spark (since 2013)

(3)open-sourced caffeonspark (2016)

a.caffeOnSpark Update Recent Enhancements and use Cases

b.Wednesday @12:20 by Mridul Jain & Jun Shi

2.Large investment in production clusters

(1) Tens of clusters

(2)Thousands of nodes per cluster

3.Massive amounts of data

(1)Petabytes of data

二、Why TensorFlowOnSpark ?

Machine-learning at scale ?

TFOnspark for deeplearning

CaffeOnSpark for deeplearning

MLLib for non-deep learning

Hive or Spark SQL for Data Analysis

spark

hadoop datasets

Figure 2:TensorFlowOnSpark for deep learning on spark clusters

三、TesnsorFlowOnSpark Design Goals

1.Scale up existing TF apps with minimal changes

2.Support all current TensorFlow functionality

(1)Synchronous / asynchronous training

(2) Model /data parallelism

(3)TensorBoard

3.Integrate with existing HDFS data piplines and ML aalgorithms

ex:Hive ,Spark,MLib

四、TesnsorFlowOnSpark

1.Pyspark wrapper of TF app code

2.Launches distributed TF clusters using Spark executors

3.Support TF data ingestion modes

(1)feed_dict ---RDD.mapPartitions()

(2)queue_runner-direct HDFS access from TF

4.Support TensorBoard during/after training

5.Generally agnostic to Spark /TF versions

五、

六、API Example

cluster =TFCluster.run(sc,map_fn,args,num_executors,num_ps,tensorboard,input_mode)

cluster.train(dataRDD,num_epochs=0)

cluster.inference(dataRDD)

cluster.shutdown()

七、

八、Failure Recovery

1.TF Checkpoints written to HDFS

2.InputMode.SPARK

(1)TF worker runs in background

(2)RDD data feeding tasks can be retired

(3)However ,TF worker failures will be 'hidden' from spark

3.InputMode.TENSORFLOW

(1)TF worker runs in foreground

(2)TF workers failures will be retired as Spark task

(3)TF worker rsstores from checkpoint

九、What 's New ?

1.Community contributors

CDH compatibility 、 TFNode.DATAFeed、 Bug fixes

2.RDMA merger into TensorFlow repository

3.Registration server

4.Spark streaming

5.Pip packaging

今天测试两个项目的收尾工作，和另一个项目。

早上九点之后：

1.复现高并发压力测试下，短信发送时间的存储格式是否正确，然并卵，没有复现。

2.在高并发4000条短信时，到最后有好多个请求失败，原因连接超时。ramp-up period（in seconds）这个概念不太理解？？？

网上截图

3.下午开始新项目测试

摇一摇，目前环我知道的环境有：开发环境，测试环境，预发布环境，生产环境。

开发在开发环境已经测试好了，让测试开始测，（1）.提交的一个bug开发不予解决，原因是我们才测试未清理缓存。（开发未说明任何需要注意的地方，比如清redis的缓存）；（2）前台报not found，我们找不到存放log的路径，找开发帮忙解决问题，开发说，这是你们自己的测试库（密码不外传），自己维护。。。。。

过程曲折，最后他还是过来帮忙定位问题，还嫌弃我的xshell，。。。。

问题解决：端口号错了。

4.晚上加班继续测试，开发老大说目前测不了，明天继续调环境。。。。

5.晚上下班路上，听老徐的linuX分享，所讲命令都是自己以前用过的，df -h du -sh*

cat a.txt >> b.txt tail -f cat.log

ps - ef|grep jenkins netatat -anlp | grep 8088 kill -9 进程号等

明天研究一下老徐布置的作业。

网友评论

管理经验积累

本文标题：20170614-日报

本文链接：https://www.haomeiwen.com/subject/zizeqxtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

20170614-日报

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

管理经验积累