美文网首页
Getting and cleaning data——Week1

Getting and cleaning data——Week1

作者: Chamberzero | 来源:发表于2021-09-28 01:24 被阅读0次

课程github地址

Week1 内容
Content
1. Data collection
  - Raw files (.csv,.xlsx)
  - Databases (mySQL)
  - APIs
2. Data formats
  - Flat files (.csv,.txt)
  - XML
  - JSON
3. Making data tidy
4. Distributing data
5. Scripting for data cleaning

1.原始数据与处理后数据

原始数据和处理后数据的区别

2.整洁数据的组成

!!重要 对于数据来说应该有这四部分组成:

  • 原始数据
  • 整洁数据
  • 编码本(描述每一个变量以及其值)
  • 从原始数据到处理完成的详细步骤(主要是脚本)

原始数据

识别原始数据

整洁数据

!!重要 整洁数据的四个特征:

  • 每一个变量在单独的一列
  • 每个不同的变量观测值应该在不同的行中
  • 每“一类”的变量应当用一个单独的表格记录
  • 多个表格时应当有一个键值变量将表格链接起来

整洁数据的技巧

  • 每一个文件的顶部仅包含一行变量名称
  • 变量名称尽可能易于理解
  • 每一张数据表单应当存在单独的文件中

编码本

编码本

关于编码本的细节

How to code variables
When you put variables into a spreadsheet there are several main categories you will run into depending on their data type:

  1. Continuous
  2. Ordinal
  3. Categorical
  4. Missing
  5. Censored

Continuous variables are anything measured on a quantitative scale that could be any fractional number. An example would be something like weight measured in kg. Ordinal data are data that have a fixed, small (< 100) number of levels but are ordered. This could be for example survey responses where the choices are: poor, fair, good. Categorical data are data where there are multiple categories, but they aren't ordered. One example would be sex: male or female. This coding is attractive because it is self-documenting. Missing data are data that are unobserved and you don't know the mechanism. You should code missing values as NA. Censored data are data where you know the missingness mechanism on some level. Common examples are a measurement being below a detection limit or a patient being lost to follow-up. They should also be coded as NA when you don't have the data. But you should also add a new column to your tidy data called, "VariableNameCensored" which should have values of TRUE if censored and FALSE if not. In the code book you should explain why those values are missing. It is absolutely critical to report to the analyst if there is a reason you know about that some of the data are missing. You should also not impute/make up/ throw away missing observations.
In general, try to avoid coding categorical or ordinal variables as numbers. When you enter the value for sex in the tidy data, it should be "male" or "female". The ordinal values in the data set should be "poor", "fair", and "good" not 1, 2 ,3. This will avoid potential mixups about which direction effects go and will help identify coding errors.
Always encode every piece of information about your observations using text. For example, if you are storing data in Excel and use a form of colored text or cell background formatting to indicate information about an observation ("red variable entries were observed in experiment 1.") then this information will not be exported (and will be lost!) when the data is exported as raw text. Every piece of data should be encoded as actual text that can be exported.

Codebook 例子


codebook example 1

R通过包自动生成codebook

阐述原始数据到整洁数据的详细过程

详细步骤

3. 下载数据

if(!file.exists("dirname")){ #判断是否存在目录
  dir.create("dirname") #创建目录
}

下载数据
download.file("url")

download.file(url, destfile, method, quiet = FALSE, mode = "w",
cacheOK = TRUE,
extra = getOption("download.file.extra"),
headers = NULL, ...)

method有curl\wget

4. 导入本地文件(简略)

read.csv()

read.table(file, header = FALSE, sep = "", quote = ""'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = FALSE,
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

read.csv(file, header = TRUE, sep = ",", quote = """,
dec = ".", fill = TRUE, comment.char = "", ...)

read.csv2(file, header = TRUE, sep = ";", quote = """,
dec = ",", fill = TRUE, comment.char = "", ...)

5. 读取excle\xml\json

excle

  • read.xlsx()

xml

使用包“XML”

xmldoc <- xmlTreeParse(url, useInternal = T) #下载并解析xml文件
xmlRoot(xmldoc)# 获取根

xml参考 #内涵伯克利的好多优秀PPT
xpath编程

Xpath
使用例子
image.png

html类似xml


image.png

json

image.png
image.png

6.data.table

直接看包解说

相关文章

网友评论

      本文标题:Getting and cleaning data——Week1

      本文链接:https://www.haomeiwen.com/subject/vytlnltx.html