LIDC-IDRI肺结节公开数据集Dicom和XML标注详解

作者: zhwhong | 来源:发表于2016-12-13 11:23 被阅读3983次

    数据来源

    数据集采用为 LIDC-IDRI (The Lung Image Database Consortium),该数据集由胸部医学图像文件(如CT、X光片)和对应的诊断结果病变标注组成。该数据是由美国国家癌症研究所(National Cancer Institute)发起收集的,目的是为了研究高危人群早期癌症检测。
      该数据集中,共收录了1018个研究实例。对于每个实例中的图像,都由4位经验丰富的胸部放射科医师进行两阶段的诊断标注。在第一阶段,每位医师分别独立诊断并标注病患位置,其中会标注三中类别:1) >=3mm的结节, 2) <3mm的结节, 3) >=3mm的非结节(官网描述: "nodule > or =3 mm," "nodule <3 mm," and "non-nodule > or =3 mm" 详见 Summary)。在随后的第二阶段中,各位医师都分别独立的复审其他三位医师的标注,并给出自己最终的诊断结果。这样的两阶段标注可以在避免forced consensus的前提下,尽可能完整的标注所有结果。

    数据位置: @news-ai:/baina/sda1/data/lidc/

    解析结果

    1.图像矩阵像素信息
      模块处理的数据为slicer * rows* cols大小的三维矩阵D。D中第z个切片y行x列的元素对应的位置为:(z rows cols+ y * cols + x) * sizeof(data_type) 。其中rows表示图像的行数,cols表示图像的列数,默认均为512,data_type代表数据类型,默认为short。具体见:肺结节检测说明文档

    • eg: 对于病例LIDC-IDRI-0001,即为133512512的矩阵,一共133张切片,每张大小512*512,依次按顺序存入二进制文件,每个像素大小为2字节(对应C中short类型)。

    2.结节区域类型标注信息
    第一行: slicers rows cols data_type pixel_space_x pixel_space_y slice_thickness

    • slicer : 切片个数;
    • rows : 矩阵行数,默认512;
    • cols : 矩阵列数,默认512;
    • data_type : 数据类型标签。为以下枚举类型中的一种(默认SHORT_TYPE,4):enum DATA_TYPE { CHAR_TYPE, UCHAR_TYPE, INT_TYPE, UINT_TYPE, SHORT_TYPE, USHORT_TYPE, FLOAT_TYPE, DOUBLE_TYPE };
    • pixel_space_x : x线列扫描步长,单位:毫米;
    • pixel_space_y : x线行扫描步长,单位:毫米;
    • slice_thickness : z轴扫描步长(即切片厚度),单位:毫米。

    其他行: Type num x1 y1 z1 x2 y2 z2 … xi yi zi ... xn yn zn
    Type: "1"表示"nodules", "2"表示"small_nodules","3"表示"non_nodules";
    num:该行x,y,z数字的个数(由于一个点有三个坐标,所以num为3的倍数);
    Xi, Yi, Zi:该肺结节第i个点的空间坐标,Zi为切片序号;

    数据位置: @news-ai:/baina/sda1/data/lidc_matrix/ (DAT为矩阵,TXT为标注)

    数据分析

    文件结构

    目前测试一共1012个病例数据,每个病例文件夹对应结构:
    LIDC-IDRI-XXXX / Study Instance UID / Series Instance UID / *.dcm *.xml

    • XXXX : 从0000到1012;
    • Study Instance UID : 每个病例对应的检查实例号;
    • Series Instance UID : 不同检查对应的序列实例号;
    • *.dcm ,*.xml : 解析见LIDC-IDRI图像标注处理记录

    特例:LIDC-IDRI-0365号病例存在两份序列检查,分别有对应的dcm和xml文件,如下:

    **Dicom重要信息说明 **

    eg : LIDC-IDRI-0001(GE MEDICAL SYSTEM公司)中000001.dcm如下:(详见 DICOM的常用Tag分类和说明

    (0008, 0005) Specific Character Set              CS: 'ISO_IR 100'
    (0008, 0008) Image Type                          CS: ['ORIGINAL', 'PRIMARY', 'AXIAL']
    (0008, 0016) SOP Class UID                       UI: CT Image Storage
    (0008, 0018) SOP Instance UID                    UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.143451261327128179989900675595
    (0008, 0020) Study Date                          DA: '20000101'
    (0008, 0021) Series Date                         DA: '20000101'
    (0008, 0022) Acquisition Date                    DA: '20000101'
    (0008, 0023) Content Date                        DA: '20000101'
    (0008, 0024) Overlay Date                        DA: '20000101'
    (0008, 0025) Curve Date                          DA: '20000101'
    (0008, 002a) Acquisition DateTime                DT: '20000101'
    (0008, 0030) Study Time                          TM: ''
    (0008, 0032) Acquisition Time                    TM: ''
    (0008, 0033) Content Time                        TM: ''
    (0008, 0050) Accession Number                    SH: '2819497684894126'
    (0008, 0060) Modality                            CS: 'CT'
    (0008, 0070) Manufacturer                        LO: 'GE MEDICAL SYSTEMS'
    (0008, 0090) Referring Physician Name            PN: ''
    (0008, 1090) Manufacturer Model Name             LO: 'LightSpeed Plus'
    (0008, 1155) Referenced SOP Instance UID         UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.675906998158803995297223798692
    (0010, 0010) Patient Name                        PN: ''
    (0010, 0020) Patient ID                          LO: 'LIDC-IDRI-0001'
    (0010, 0030) Patient Birth Date                  DA: ''
    (0010, 0040) Patient Sex                         CS: ''
    (0010, 1010) Patient Age                         AS: ''
    (0010, 21d0) Last Menstrual Date                 DA: '20000101'
    (0012, 0062) Patient Identity Removed            CS: 'YES'
    (0012, 0063) De-identification Method            LO: 'DCM:113100/113105/113107/113108/113109/113111'
    (0013, 0010) Private Creator                     LO: 'CTP'
    (0013, 1010) Private tag data                    LO: 'LIDC-IDRI'
    (0013, 1013) Private tag data                    LO: '62796001'
    (0018, 0010) Contrast/Bolus Agent                LO: 'IV'
    (0018, 0015) Body Part Examined                  CS: 'CHEST'
    (0018, 0022) Scan Options                        CS: 'HELICAL MODE'
    (0018, 0050) Slice Thickness                     DS: '2.500000'
    (0018, 0060) KVP                                 DS: '120'
    (0018, 0090) Data Collection Diameter            DS: '500.000000'
    (0018, 1020) Software Version(s)                 LO: 'LightSpeedApps2.4.2_H2.4M5'
    (0018, 1100) Reconstruction Diameter             DS: '360.000000'
    (0018, 1110) Distance Source to Detector         DS: '949.075012'
    (0018, 1111) Distance Source to Patient          DS: '541.000000'
    (0018, 1120) Gantry/Detector Tilt                DS: '0.000000'
    (0018, 1130) Table Height                        DS: '144.399994'
    (0018, 1140) Rotation Direction                  CS: 'CW'
    (0018, 1150) Exposure Time                       IS: '570'
    (0018, 1151) X-Ray Tube Current                  IS: '400'
    (0018, 1152) Exposure                            IS: '4684'
    (0018, 1160) Filter Type                         SH: 'BODY FILTER'
    (0018, 1170) Generator Power                     IS: '48000'
    (0018, 1190) Focal Spot(s)                       DS: '1.200000'
    (0018, 1210) Convolution Kernel                  SH: 'STANDARD'
    (0018, 5100) Patient Position                    CS: 'FFS'
    (0020, 000d) Study Instance UID                  UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.298806137288633453246975630178
    (0020, 000e) Series Instance UID                 UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.179049373636438705059720603192
    (0020, 0010) Study ID                            SH: ''
    (0020, 0011) Series Number                       IS: '3000566'
    (0020, 0013) Instance Number                     IS: '80'
    (0020, 0032) Image Position (Patient)            DS: ['-166.000000', '-171.699997', '-207.500000']
    (0020, 0037) Image Orientation (Patient)         DS: ['1.000000', '0.000000', '0.000000', '0.000000', '1.000000', '0.000000']
    (0020, 0052) Frame of Reference UID              UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.229925374658226729607867499499
    (0020, 1040) Position Reference Indicator        LO: 'SN'
    (0020, 1041) Slice Location                      DS: '-207.500000'
    (0028, 0002) Samples per Pixel                   US: 1
    (0028, 0004) Photometric Interpretation          CS: 'MONOCHROME2'
    (0028, 0010) Rows                                US: 512
    (0028, 0011) Columns                             US: 512
    (0028, 0030) Pixel Spacing                       DS: ['0.703125', '0.703125']
    (0028, 0100) Bits Allocated                      US: 16
    (0028, 0101) Bits Stored                         US: 16
    (0028, 0102) High Bit                            US: 15
    (0028, 0103) Pixel Representation                US: 1
    (0028, 0120) Pixel Padding Value                 US: 63536
    (0028, 0303) Longitudinal Temporal Information M CS: 'MODIFIED'
    (0028, 1050) Window Center                       DS: '-600'
    (0028, 1051) Window Width                        DS: '1600'
    (0028, 1052) Rescale Intercept                   DS: '-1024'
    (0028, 1053) Rescale Slope                       DS: '1'
    (0038, 0020) Admitting Date                      DA: '20000101'
    (0040, 0002) Scheduled Procedure Step Start Date DA: '20000101'
    (0040, 0004) Scheduled Procedure Step End Date   DA: '20000101'
    (0040, 0244) Performed Procedure Step Start Date DA: '20000101'
    (0040, 2016) Placer Order Number / Imaging Servi LO: ''
    (0040, 2017) Filler Order Number / Imaging Servi LO: ''
    (0040, a075) Verifying Observer Name             PN: 'Removed by CTP'
    (0040, a123) Person Name                         PN: 'Removed by CTP'
    (0040, a124) UID                                 UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.335419887712224178340067932923
    (0070, 0084) Content Creator's Name              PN: ''
    (0088, 0140) Storage Media File-set UID          UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.211790042620307056609660772296
    (7fe0, 0010) Pixel Data                          OW: Array of 524288 bytes
    

    eg : LIDC-IDRI-0069(TOSHIBA公司)中000001.dcm如下:

    (0008, 0008) Image Type                          CS: ['ORIGINAL', 'PRIMARY', 'AXIAL']
    (0008, 0016) SOP Class UID                       UI: CT Image Storage
    (0008, 0018) SOP Instance UID                    UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.263800607656124864093833884216
    (0008, 0020) Study Date                          DA: '20000101'
    (0008, 0021) Series Date                         DA: '20000101'
    (0008, 0022) Acquisition Date                    DA: '20000101'
    (0008, 0023) Content Date                        DA: '20000101'
    (0008, 0024) Overlay Date                        DA: '20000101'
    (0008, 0025) Curve Date                          DA: '20000101'
    (0008, 002a) Acquisition DateTime                DT: '20000101'
    (0008, 0030) Study Time                          TM: ''
    (0008, 0032) Acquisition Time                    TM: '185549.500'
    (0008, 0033) Content Time                        TM: '185605.277'
    (0008, 0050) Accession Number                    SH: '2819497684894126'
    (0008, 0060) Modality                            CS: 'CT'
    (0008, 0070) Manufacturer                        LO: 'TOSHIBA'
    (0008, 0090) Referring Physician Name            PN: ''
    (0008, 1090) Manufacturer Model Name             LO: 'Aquilion'
    (0010, 0010) Patient Name                        PN: ''
    (0010, 0020) Patient ID                          LO: 'LIDC-IDRI-0069'
    (0010, 0030) Patient Birth Date                  DA: ''
    (0010, 0040) Patient Sex                         CS: 'M'
    (0010, 1010) Patient Age                         AS: '051Y'
    (0010, 2160) Ethnic Group                        SH: 'white-ns'
    (0010, 21c0) Pregnancy Status                    US: 4
    (0010, 21d0) Last Menstrual Date                 DA: '20000101'
    (0012, 0062) Patient Identity Removed            CS: 'YES'
    (0012, 0063) De-identification Method            LO: 'DCM:113100/113105/113107/113108/113109/113111'
    (0013, 0010) Private Creator                     OB: 'CTP '
    (0013, 1010) Private tag data                    OB: 'LIDC-IDRI '
    (0013, 1013) Private tag data                    OB: '62796001'
    (0018, 0010) Contrast/Bolus Agent                LO: '100ccs_OMNI-350'
    (0018, 0015) Body Part Examined                  CS: 'CHEST'
    (0018, 0022) Scan Options                        CS: 'HELICAL_CT'
    (0018, 0050) Slice Thickness                     DS: '2.0'
    (0018, 0060) KVP                                 DS: '135'
    (0018, 0090) Data Collection Diameter            DS: '400.00'
    (0018, 1020) Software Version(s)                 LO: 'V2.04ER001'
    (0018, 1100) Reconstruction Diameter             DS: '379.687'
    (0018, 1120) Gantry/Detector Tilt                DS: '+0.0'
    (0018, 1130) Table Height                        DS: '+48.00'
    (0018, 1140) Rotation Direction                  CS: 'CW'
    (0018, 1150) Exposure Time                       IS: '500'
    (0018, 1151) X-Ray Tube Current                  IS: '260'
    (0018, 1152) Exposure                            IS: '130'
    (0018, 1210) Convolution Kernel                  SH: 'FC10'
    (0018, 5100) Patient Position                    CS: 'FFS'
    (0020, 000d) Study Instance UID                  UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.303241414168367763244410429787
    (0020, 000e) Series Instance UID                 UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.131939324905446238286154504249
    (0020, 0010) Study ID                            SH: ''
    (0020, 0011) Series Number                       IS: '3079'
    (0020, 0012) Acquisition Number                  IS: '5'
    (0020, 0013) Instance Number                     IS: '134'
    (0020, 0020) Patient Orientation                 CS: ['L', 'P']
    (0020, 0032) Image Position (Patient)            DS: ['-184.375000', '-188.281200', '1292.500000']
    (0020, 0037) Image Orientation (Patient)         DS: ['1.000000', '0.000000', '0.000000', '0.000000', '1.000000', '0.000000']
    (0020, 0052) Frame of Reference UID              UI: 1.3.6.1.4.1.14519.5.2.1.6279.6001.228313061349684266844487315959
    (0020, 1040) Position Reference Indicator        LO: ''
    (0020, 1041) Slice Location                      DS: '+324.00'
    (0028, 0002) Samples per Pixel                   US: 1
    (0028, 0004) Photometric Interpretation          CS: 'MONOCHROME2'
    (0028, 0010) Rows                                US: 512
    (0028, 0011) Columns                             US: 512
    (0028, 0030) Pixel Spacing                       DS: ['0.741', '0.741']
    (0028, 0100) Bits Allocated                      US: 16
    (0028, 0101) Bits Stored                         US: 16
    (0028, 0102) High Bit                            US: 15
    (0028, 0103) Pixel Representation                US: 1
    (0028, 0303) Longitudinal Temporal Information M CS: 'MODIFIED'
    (0028, 1050) Window Center                       DS: '-500'
    (0028, 1051) Window Width                        DS: '2000'
    (0028, 1052) Rescale Intercept                   DS: '0'
    (0028, 1053) Rescale Slope                       DS: '1'
    (0032, 000a) Study Status ID                     CS: ''
    (0032, 1000) Scheduled Study Start Date          DA: ''
    (0032, 1001) Scheduled Study Start Time          TM: ''
    (0032, 1060) Requested Procedure Description     LO: ''
    (0032, 1064)  Requested Procedure Code Sequence   1 item(s) ---- 
       (0008, 0104) Code Meaning                        LO: ''
       ---------
    (0038, 0020) Admitting Date                      DA: '20000101'
    (0040, 0002) Scheduled Procedure Step Start Date DA: '20000101'
    (0040, 0003) Scheduled Procedure Step Start Time TM: ''
    (0040, 0004) Scheduled Procedure Step End Date   DA: '20000101'
    (0040, 0005) Scheduled Procedure Step End Time   TM: ''
    (0040, 0244) Performed Procedure Step Start Date DA: '20000101'
    (0040, 0245) Performed Procedure Step Start Time TM: ''
    (0040, 2016) Placer Order Number / Imaging Servi LO: ''
    (0040, 2017) Filler Order Number / Imaging Servi LO: ''
    (0040, a075) Verifying Observer Name             PN: 'Removed by CTP'
    (0040, a123) Person Name                         PN: 'Removed by CTP'
    (0070, 0084) Content Creator Name                PN: ''
    (7fe0, 0010) Pixel Data                          OB or OW: Array of 524288 bytes
    

    可以看到不同公司所做的检查存储信息的格式不太一样,但一些主要信息都还是有的:

    • SOP Instance UID 用于唯一区分每一张dcm切片,其中Study Instance UID,Series Instance UID上面已经提过,分别用于区分检查号和一次检查对应序列号。
    • Modality 表示检查模态,有MRI,CT,CR,DR等;
    • Manufacturer 表示制造商,经分析共有"GE MEDICAL SYSTEMS"(最多), "SIEMENS", "TOSHIBA", "Philips"四家制造商提供数据。详见:/baina/sda1/data/lidc_matrix/information.txt
    • Slice Thickness 表示z方向切片厚度,经统计有GE MEDICAL SYSTEMS:2.50, 1.25,SIEMENS:0.75,1.0, 2.0,3.0,5.0,TOSHIBA:2.0, 3.0, Philips:2.0,1.0,1.5,0.9;
    • Instance Number 表示一组切片的序列号,这个可以直接用来将切面排序,在实际CT扫描时,是从胸部靠近头的一侧开始扫描,一次扫描到肺部最下,得到的instance number依次增加,对应的Image Position中的z依次减小,而对应的Slice Location是相对位置,绝大多数情况与Image Positon中的z值相同,依次减小,部分不同公司,如TOSHIBA则Slice Location可能与Image Position中的z不同,由于是相对位置,其Slice Location值为正,并且和Instance Number的变化趋势相同。为了在实际分析是不出现错误,不能仅仅采用Slice Location来对切片进行排序,而应使用Instance Number或者Image Position中的z,此次实验使用的是Instance Number。
    • Image Position表示图像的左上角在空间坐标系中的x,y,z坐标,单位是毫米,如果在检查中,则指该序列中第一张影像左上角坐标;
    • Slice Location为切片z轴相对位置,单位毫米,大多情况与Image Position中的z相同,但TOSHIBA公司提供的数据里面不同,所以不能仅仅根据这个值来对所有切片进行统一排序
    • Photometric Interpretation:光度计的解释,对于CT图像,用两个枚举值MONOCHROME1,MONOCHROME2.用来判断图像是否是彩色的,MONOCHROME1/2是灰度图,RGB则是真彩色图,还有其他;
    • Pixel Spacing 表示像素中心间的物理间距;
    • Bits Allocated表示存储每一位像素时分配位数,Bits Stored 表示存储每一位像素所用位数;
    • Pixel Representation 表示像素数据的表现类型:这是一个枚举值,分别为十六进制数0000和0001,0000H = 无符号整数,0001H = 2的补码。

    **XML重要信息说明 **

    分析所有1012个病人XML标注信息,存在如下问题:

    医生标注信息可能有误(个人觉得)!!!!!!

    对所有病例跑完标注脚本(/home/zhwhong/API/get_txt.sh)时,在生成的log日志(/baina/sda1/data/lidc_matrix/get_txt.log)里面发现有问题的病例有四个,分别是LIDC-IDRI-0017,LIDC-IDRI-0365,LIDC-IDRI-0566,LIDC-IDRI-0659。
    【LIDC-IDRI-0017】

    我们找到这个不存在的sop_uid,为"1.3.6.1.4.1.14519.5.2.1.6279.6001.305973183883758685859912046949",然后找到病例17对应的XML文件,看一下医生的标注信息:
    带有这个sop_uid的标注有两个,分别是医师2和医师4,我们看一下他们的标注:
    医师2:

    医师4:

    对,有两个医师都标注了这个sop_uid,并且对应的ImageZposition为-82.75,我们再在XML文件中找到ImageZposition为-82.75的另外两个医师是否有标注,结果是有,但是另外两个医师标注的-82.75的位置对应的切片的sop_uid和医师2,4不同,分别如下:
    医师1:

    医师3:

    这就很尴尬了,同一个ImageZpositon,但是却标了不同的sop_uid,于是追根溯源,看一下到底是怎么回事,自己写脚本遍历LIDC-IDRI-0017中所有dcm切片,打印出所有切片sop_uid,作对比,然后发现在所有的结果中,根本没有找到医师2,医师4标记的那个sop_uid,而医师1,医师3的标注是存在的,如下:
    医师2,4标记的sop_uid找不到:

    医师1,3标记的找到了:

    所以初步认定,LIDC-IDRI-0017病例中,医师2和医师4存在两处错误的标注信息(sop_uid错误)
    【LIDC-IDRI-0365】
    LIDC-IDRI-0365中存在两份检查序列,分别是:
    1.3.6.1.4.1.14519.5.2.1.6279.6001.212341120080087350703610584139 / 1.3.6.1.4.1.14519.5.2.1.6279.6001.207544473852086582434957174616

    1.3.6.1.4.1.14519.5.2.1.6279.6001.216207548522622026268886920069 / 1.3.6.1.4.1.14519.5.2.1.6279.6001.802846969823720586279982179144
    存在问题的是第二份序列,问题同17号病例类似,如下:

    找到医生标注如下(四位医师标注相同):

    同样遍历LIDC-IDRI-0365中第二份序列,找不到对应标记的切片sop_uid:

    【LIDC-IDRI-0566】
    存在和上面相同的问题:

    【LIDC-IDRI-0659】


    (注:感谢您的阅读,希望本文对您有所帮助。如果觉得不错欢迎分享转载,但请先点击 这里 获取授权。本文由 版权印 提供保护,禁止任何形式的未授权违规转载,谢谢!)

    相关文章

      网友评论

      • af4eb72f079f:大神,我又来重新看你的文章啦。d(゚∀゚d)点赞!
      • AutumnsFall:您好,请问xml文件里,结节的边界信息怎么应用到模型的训练中?请问有没有相应的论文或代码?
        朝霞_c0ab:我想向你请教一下问题,可以加个微信吗 zhaoxia9593
      • 3cd5123b60a1:你好,请问在哪里可以看到SOP UID诶,我下载的一个文件夹下的xml怎么感觉和这个文件夹下的dcm不是相匹配的诶,先行蟹蟹诶
      • 虚拟现实_959f:https://wiki.cancerimagingarchive.net/display/Public/LIDC-IDRI好像没法打开,无法访问。不知道什么原因
        zhwhong:可能是你网络的原因,正常情况下是可以打开的哈
      • 4e031e98c32a:对所有病例跑完标注脚本(/home/zhwhong/API/get_txt.sh)时,在生成的log日志(/baina/sda1/data/lidc_matrix/get_txt.log)里面发现有问题的病例有四个,分别是LIDC-IDRI-0017,LIDC-IDRI-0365,LIDC-IDRI-0566,LIDC-IDRI-0659。
        【LIDC-IDRI-0017】
        请问大神,怎么标注脚本,生成log日志
        zhwhong:你在跑脚本时,比如 sh get_txt.sh > get_txt.log 就行,相当于把输出从屏幕重定向到文件
        zhwhong:额,这其实就是你跑程序的输出,只是输出太多了,所以把它重定向到了一个.log文件而已。你可以看这个 https://github.com/zhwhong/lidc_nodule_detection/blob/master/api_lidc/get_txt.sh ,生成的log大概长这样 https://github.com/zhwhong/lidc_nodule_detection/blob/master/api_lidc/nohup2.log
      • 41eebf7b16f6:您好,可以请问这个数据库在哪里下载吗?我找不到下载的地址链接
        zhwhong:https://wiki.cancerimagingarchive.net/display/Public/LIDC-IDRI

      本文标题:LIDC-IDRI肺结节公开数据集Dicom和XML标注详解

      本文链接:https://www.haomeiwen.com/subject/rknvmttx.html