配置服务器上的jupyter与pyspark

作者: 沿哲 | 来源:发表于2020-11-20 16:12 被阅读0次

配置服务器上的jupyter与pyspark
服务器上搭建Jupyter Notebook服务
在服务器上使用jupyter
Python开发常用命令
findspark用法
jupyter notebook+Spark配置远程登录服务器
conda新建jupyter内核
Jupyter配置教程
jupyter notebook使用小技巧
pyspark使用方法

系统说明

服务器：ubuntu 18
本机：win10
效果：配置服务器的jupyter+pyspark环境，本机可以从外部使用

1. 升级python2为python3

亲测有效文章：apt-get安装python3后建立新的连接
https://www.cnblogs.com/wmr95/p/7637077.html

2. 安装pip3包

第一步中我安装的是py3.6，后面需要用到pip install发现报错：command not find
于是需要重新安装一下pip包
https://blog.csdn.net/qq_36269513/article/details/80450421

3. 安装jupyter

pip install jupyter

4. jupyter设置远程访问

生成配置文件

jupyter notebook --generate-config

打开python编译环境（比如在终端输入python，ipython等）

from notebook.auth import passwd
passwd()
Enter password:
Verify password:
'********'

上面的'********' 记好，下面用到
编辑jupyter_notebook_config.py文件，加入以下

c.NotebookApp.ip='0.0.0.0'
c.NotebookApp.password = u'********'
c.NotebookApp.open_browser = False
c.NotebookApp.port =8888 #随便指定一个端口
c.IPKernelApp.pylab = 'inline'
c.NotebookApp.allow_remote_access = True

在终端启动jupyter后访问服务器的IP+刚才设置好的端口就OK 了

5. 配置pyspark

更详细的配置见https://zhuanlan.zhihu.com/p/52467451?utm_source=wechat_session
我只执行了下面三句就可以了，也没有配置什么环境变量

pip install pyspark

sudo apt-get install openjdk-8-jdk

pip install findspark

6. 效果

我执行了下面这个例程，README文档自己随便写了一个

import os
import shutil
from pyspark import SparkContext
 
inputpath = 'README.txt'
outputpath = 'out.txt'
 
sc = SparkContext('local', 'wordcount')

# 读取文件
input = sc.textFile(inputpath)
# 切分单词
words = input.flatMap(lambda line: line.split(' '))
# 转换成键值对并计数
counts = words.map(lambda word: (word, 1)).reduceByKey(lambda x, y: x + y)
 
# 输出结果
counts.foreach(print)

执行完后 在out文件夹下有两个文件
# 删除输出目录
if os.path.exists(outputpath):
    shutil.rmtree(outputpath, True)
 
# 将统计结果写入结果文件
counts.saveAsTextFile(outputpath)