如何处理Excel表格中的删除线

作者: welder77 | 来源:发表于2023-01-07 21:20 被阅读0次

轻松应对80%财务工作需求，财务EXCEL表格免费学
Excel小技巧：表格中带单位的数值如何计算？
Word、Excel技巧
【Excel】如何删除表格中的重复项
Excel表格如何导入CAD软件中？
Excel技巧：Excel如何绘制边框？
EXcel与MySQL之间的数据迁移
Python处理excel表格
Excel快速删除表格内容的两个小技巧。
Excel小技巧：如何固定表格首列？

今天，还是来讲下工作中遇到的一个Excel处理的小问题。

系统工程师，在使用Excel编写需求的时候，还是非常喜欢使用删除线的，如下图。
使用删除线的好处不言而喻，可以看出需求变更的前后过程。但这样添加了删除线后对于自动化程序来说则是一个难题，今天我们来介绍下应对方法。

strike.png

我们首先对Excel的删除线做一个简单的分类：
第一类：上图A列或B列这样, 整个单元格的字体都设置为删除线。
第二类：上图C列这样,单元格中有部分内容是带删除线的，又有部分是正常的字体。

默认情况下，主流的Python Excel处理库在读取以上两类带删除线的表格时，都只能读到表格中实际的内容，如B列的“test2”和C列的“old current new”，根本不会因为删除线的存在而有所区别。
这样的处理会导致我们在解析表格内容的时候无从下手，特别是“old current new”这样的内容，我们根本不知道，实际编写者的意图其实是current。

下面我们就针对以上两类删除线，提供解决方案。

方案1 - xml文件修改 + openpyxl方案：
这个方案来源于CSDN的天地之心online，简单解释下这个方案，excel的表格本质上其实是一系列xml文件，xml中定义了表格的内容以及格式。所以删除线自然也在格式相关的xml中有所定义，具体在哪里定义呢？答案是在xl目录中的sharedStrings.xml文件中。
类似于C列的表格中，部分是常规内容，部分带删除线的内容，带删除线的格式定义就会在此文件中定义，它们以<strike/>开头。所以我们可以利用正则表达式去掉后面的实际内容，从而实现了去除删除线所属的内容。

image.png

以下是xml文件修改部分的代码：

    def excel_strike_content_delete(path, final_file):
        extract_path = os.path.join(os.path.split(path)[0], "extract_path")
        if os.path.exists(extract_path):
            shutil.rmtree(extract_path)
            os.mkdir(extract_path)
        else:
            os.mkdir(extract_path)
        f = zipfile.ZipFile(path)
        for file in f.namelist():
            f.extract(file, extract_path)
        f.close()
        xml = extract_path + '/xl/sharedStrings.xml'
        with open(xml, 'rt', encoding="utf-8") as f:
            content = f.readlines()
            content = "".join([line for line in content])

        def replace(matched):
            group0 = matched.group(0)
            group1 = matched.group(1)
            group2 = matched.group(2)
            # print(group0)
            res = group0.replace(group2, "<t></t>").replace(group1, "")
            # print("res: ", res)
            return res

        pattern = r"<rPr>(<strike/>).*?</rPr>(<t>(?:.*?)</t>)"
        # print(type(content))
        line = re.sub("</si><si>", "</si>myLabel<si>", content, flags=re.S)
        # print(type(line))
        ls = line.split("myLabel")
        for i, line in enumerate(ls):
            # print("  ", line)
            line = re.sub(pattern, replace, line, flags=re.S)
            ls[i] = line
            # print(line)

        with open(xml, 'w+', encoding='utf-8') as f:
            f.write("".join(ls))

        if path == final_file:
            move_path = os.path.join(os.path.split(path)[0], "xlsx_strike_backup")
            if os.path.exists(move_path):
                shutil.rmtree(move_path)
                os.mkdir(move_path)
            else:
                os.mkdir(move_path)
            new_move_file =os.path.join(move_path, path)
            try:
                shutil.move(path,new_move_file)
            except:
                tmp = str(path).rsplit(".",1)
                xlsx_file_copy = f"{tmp[0]}_copy.{tmp[1]}"
                final_file = xlsx_file_copy

        f = zipfile.ZipFile(final_file, mode='w')
        for path, dirnames, filenames in os.walk(extract_path):
            # 去掉目标跟路径，只对目标文件夹下边的文件及文件夹进行压缩
            fpath = path.replace(extract_path, '')
            for filename in filenames:
                f.write(os.path.join(path, filename), os.path.join(fpath, filename))
        f.close()
        return os.path.abspath(final_file)

通过以上xml修改后，我们已经可以顺利把C列表格中的删除线去掉。但A列或B列这样整个单元格都带删除线的内容还是无法清除。我们还需要利用openpyxl读取单元格的字体属性是否包含strike来过滤，下面是对应的代码。

import openpyxl

book = openpyxl.load_workbook("test.xlsx")
sheet = book.worksheets[0]

for row in sheet.rows:
    for cell in row:
        if cell.font.strike:
            pass

经过以上两种处理方式之后，两种类型的删除线都会被清除干净。

方案2 - xlwings方案：
这个方案来自于StackOverflow，只需要通过一个库xlwings就可以搞定，xlwings既可以读取单元格中的strike属性，也可以读取到单个字符的strike属性，所以可以一次性把两种删除线都清除掉。
不过初学者使用xlwings还是需要注意一点，因为xlwings是通过调用win32 api对excel文件进行编辑的，所以需要额外注意每次操作完成excel后或程序发生以外时，都要关闭excel文件，否则很有可能使被操作的excel文件处于锁定状态。

import xlwings as xw

wb = xw.Book('foo1.xlsx')
ws = wb.sheets('Sheet1')

for cell in ws.range('A1:B3'):
    cell_coord = str(cell.address.replace('$', ''))
    print('Cell: ' + cell_coord + ', Cell value: ' + str(cell.value), end=', ')

    st = cell.font.impl.xl.Strikethrough
    print('ST value: ' + str(st), end=' ')
    if st:
        print(', Cell has strikethrough font.')
        cell.clear()
    elif st is None:
        print(', Cell has mixed font.')
        num_chars = len(cell.value)
        print(cell_coord + ' Text length: ' + str(num_chars) + ' characters.')
        print("Individual character font strikethrough?")
        char_position = 0
        while True:
            cur_text_value = cell.characters[char_position].text
            if cur_text_value == '':
                break
            else:
                print("'" + cur_text_value + "'", end=' ')
                char_is_st_font = cell.characters[char_position].font.impl.xl.Strikethrough
                print(char_is_st_font)

                if char_is_st_font:
                    cell.characters[char_position].api.Delete()
                else:
                    char_position += 1

    else:
        print(', Cell has NO strikethrough font.')

wb.save('foo2.xlsx')
wb.close()

总结，Python处理Excel的库有很多，操作也不难，但在读写Excel时，类似于删除线，还有像合并单元格，隐藏的行或列，带公式单元格，浮点数，等细节处理，并不能简单通过默认读写功能实现。在网上搜索不同方法，合理组合其才能更好的实现预期。