今天周三
9.00之前到公司,做自己感兴趣的知识
将前同事分享的PPT的内容整理出来,了解,看懂。tensorflow on spark
一、why tensorflowonspark at yahoo
1.major contributor to open-source hadoop ecosystem
(1)originator of hadoop (2006)
(2)an early adopter of spark (since 2013)
(3)open-sourced caffeonspark (2016)
a.caffeOnSpark Update Recent Enhancements and use Cases
b.Wednesday @12:20 by Mridul Jain & Jun Shi
2.Large investment in production clusters
(1) Tens of clusters
(2)Thousands of nodes per cluster
3.Massive amounts of data
(1)Petabytes of data
二、Why TensorFlowOnSpark ?
Machine-learning at scale ?
TFOnspark for deeplearning
CaffeOnSpark for deeplearning
MLLib for non-deep learning
Hive or Spark SQL for Data Analysis
spark
hadoop datasets
Figure 2:TensorFlowOnSpark for deep learning on spark clusters
三、TesnsorFlowOnSpark Design Goals
1.Scale up existing TF apps with minimal changes
2.Support all current TensorFlow functionality
(1)Synchronous / asynchronous training
(2) Model /data parallelism
(3)TensorBoard
3.Integrate with existing HDFS data piplines and ML aalgorithms
ex:Hive ,Spark,MLib
四 、TesnsorFlowOnSpark
1.Pyspark wrapper of TF app code
2.Launches distributed TF clusters using Spark executors
3.Support TF data ingestion modes
(1)feed_dict ---RDD.mapPartitions()
(2)queue_runner-direct HDFS access from TF
4.Support TensorBoard during/after training
5.Generally agnostic to Spark /TF versions
五、
六、API Example
cluster =TFCluster.run(sc,map_fn,args,num_executors,num_ps,tensorboard,input_mode)
cluster.train(dataRDD,num_epochs=0)
cluster.inference(dataRDD)
cluster.shutdown()
七、
八、Failure Recovery
1.TF Checkpoints written to HDFS
2.InputMode.SPARK
(1)TF worker runs in background
(2)RDD data feeding tasks can be retired
(3)However ,TF worker failures will be 'hidden' from spark
3.InputMode.TENSORFLOW
(1)TF worker runs in foreground
(2)TF workers failures will be retired as Spark task
(3)TF worker rsstores from checkpoint
九、What 's New ?
1.Community contributors
CDH compatibility 、 TFNode.DATAFeed、 Bug fixes
2.RDMA merger into TensorFlow repository
3.Registration server
4.Spark streaming
5.Pip packaging
今天测试两个项目的收尾工作,和另一个项目。
早上九点之后:
1.复现高并发压力测试下,短信发送时间的存储格式是否正确,然并卵,没有复现。
2.在高并发4000条短信时,到最后有好多个请求失败,原因连接超时。ramp-up period(in seconds) 这个概念不太理解???
网上截图
3.下午开始新项目测试
摇一摇,目前环我知道的环境有:开发环境,测试环境,预发布环境,生产环境。
开发在开发环境已经测试好了,让测试开始测,(1).提交的一个bug开发不予解决,原因是我们才测试未清理缓存。(开发未说明任何需要注意的地方,比如清redis的缓存);(2)前台报not found,我们找不到存放log的路径,找开发帮忙解决问题,开发说,这是你们自己的测试库(密码不外传),自己维护。。。。。
过程曲折,最后他还是过来帮忙定位问题,还嫌弃我的xshell,。。。。
问题解决:端口号错了。
4.晚上加班继续测试,开发老大说目前测不了,明天继续调环境。。。。
5.晚上下班路上,听老徐的linuX分享,所讲命令都是自己以前用过的,df -h du -sh*
cat a.txt >> b.txt tail -f cat.log
ps - ef|grep jenkins netatat -anlp | grep 8088 kill -9 进程号等
明天研究一下老徐布置的作业 。
网友评论