使用boost::split_iterator进行字符串分割

作者: FredricZhu | 来源:发表于2021-04-17 13:42 被阅读0次

使用boost::split_iterator进行字符串分割
iOS 字符串处理：截取字符串、匹配字符串、分割字符串、
使用flutter_boost混合开发时，android端返回键
shell 字符串分割与循环数组
使用多个分隔符分隔字符串
swift5.0 数组Array的常用方法总结
LeetCode Serialize and Deseriali
javascript隐藏手机号中间4位两种方法
ruby koans—strings中split用法
算法- 单词拆分

代码非常简单，实际上就是根据一个分割字符串组合，来返回分割后的字符串列表。
在C++中，实际返回一个
iterator<iterator_range<const char*>>类型的对象。
也就是返回一个迭代器。
迭代器中包含返回字符串的起始迭代器和结束迭代器列表。
说起来比较绕口，一起来看下代码就都明白了。
CMakeLists.txt

cmake_minimum_required(VERSION 2.6)
project(lexical_cast)

add_definitions(-std=c++14)

include_directories("/usr/local/include")
link_directories("/usr/local/lib")
file( GLOB APP_SOURCES ${CMAKE_CURRENT_SOURCE_DIR}/*.cpp)
foreach( sourcefile ${APP_SOURCES} )
    file(RELATIVE_PATH filename ${CMAKE_CURRENT_SOURCE_DIR} ${sourcefile})
    string(REPLACE ".cpp" "" file ${filename})
    add_executable(${file} ${sourcefile})
    target_link_libraries(${file} boost_filesystem boost_thread boost_system boost_serialization pthread boost_chrono)
endforeach( sourcefile ${APP_SOURCES} )

main.cpp

#include <boost/algorithm/string/split.hpp>
#include <boost/algorithm/string/classification.hpp>

#include <algorithm>
#include <iostream>

int main(int argc, char* argv[]) {
    const char str[] = "This is a long long character array."
        "Please split this character array to sentences!"
        "Do you know, that sentences are separated using period,"
        "exclamation mark and question mark? :-";
    
    using split_iter_t = boost::split_iterator<const char*>;
    split_iter_t sentences = boost::make_split_iterator(str, 
        boost::algorithm::token_finder(boost::is_any_of("?!.")));
    
    for(unsigned int i=1; !sentences.eof(); ++sentences, ++i) {
        auto range = *sentences;
        std::cout << "Sentence #" << i << ": \t" << range << '\n';
        std::cout << "Sentence has " << range.size() << " characters.\n";
        std::cout << "Sentence has "
            << std::count(range.begin(), range.end(), ' ')
            << " whitespaces. \n\n";
    }

    return 0;
}

程序输出的结果如下，

图片.png

我们一起来做下源码剖析，

先看下类图，

图片.png

可以看出，实际上的代码还是基于模板继承来实现的。
我们重点剖析下类图中提到的几个方法，
iterator_facade::operator++()，负责让split_iterator实现前向自增，

     Derived& operator++()
      {    
            // 这个Derived其实就是split_iterator<const char*>类型，
            iterator_core_access::increment(this->derived());
            return this->derived();
      }

      // iterator_core_access::increment
      template <class Facade>
      static void increment(Facade& f)
      {
           // 实际上就是直接调用了split_iterator类的increment方法
          f.increment();
      }

再来看split_iterator::increment方法

    // increment
            void increment()
            {
                // 调用 find_iterator_base基类的do_find方法进行查找，
                // 返回一个查找到内容的iterator_range对象
                // 例如在 "abcdefg!hijk" 中查找 "!"
               // 会返回g 所在位置的迭代器 和 h所在位置的迭代器，
               // 两个迭代器组成一个 iterator_range对象
                match_type FindMatch=this->do_find( m_Next, m_End );
                
                 // 如果本次没找着
                if(FindMatch.begin()==m_End && FindMatch.end()==m_End)
                {
                    // 如果当前的match变量已经到字符串末尾,
                    // 将eof标志位设置为true，
                    // 告诉你不要往后找了
                    if(m_Match.end()==m_End)
                    {
                        // Mark iterator as eof
                        m_bEof=true;
                    }
                }
               
                // 将当前匹配的子串放到 m_Match变量中
                // 这个m_Match变量也是一个iterator_range<const char*>类型的变量
                // 具体这个range包含的字符串范围是 abcdefg
                m_Match=match_type( m_Next, FindMatch.begin() );
                // 下一次的起始点，指向字符h
                m_Next=FindMatch.end();
            }

再看下detail::find_iterator_base::do_find(iterator begin, iterator end) const

   // Find operation
                match_type do_find( 
                    input_iterator_type Begin,
                    input_iterator_type End ) const
                {  
                    // 如果查找函数不为空，
                    if (!m_Finder.empty())
                    {
                       // 在迭代器上运行查找函数
                        return m_Finder(Begin,End);
                    }
                    else
                    {
                        // 否则直接返回找不到
                        return match_type(End,End);
                    }
                }

现在还有一个坑就是查找函数，这个比较简单了。
看一下boost::algorithm::token_finder函数

  template< typename PredicateT >
        inline detail::token_finderF<PredicateT>
        token_finder( 
            PredicateT Pred, 
            token_compress_mode_type eCompress=token_compress_off )
        {
            // 原来这个函数返回一个仿函数，那就比较简单了，
            // 直接看仿函数的operator()操作符重载就可以了
            return detail::token_finderF<PredicateT>( Pred, eCompress );
        }

         // Operation
                template< typename ForwardIteratorT >
                iterator_range<ForwardIteratorT>
                operator()(
                    ForwardIteratorT Begin,
                    ForwardIteratorT End ) const
                {
                    // 给返回类型起个别名，
                    typedef iterator_range<ForwardIteratorT> result_type;
                    
                    // 调用std::find_if算子，使用指定的谓词进行查找
                    ForwardIteratorT It=std::find_if( Begin, End, m_Pred );
                    // 没找着，返回两个End
                    if( It==End )
                    {
                        return result_type( End, End );
                    }
                    else
                    { 
                        // 找着了，赋值起始位置
                        ForwardIteratorT It2=It;

                        if( m_eCompress==token_compress_on )
                        {
                            // Find first non-matching character
                            while( It2!=End && m_Pred(*It2) ) ++It2;
                        }
                        else
                        {
                            // Advance by one position
                            // 起始位置+1 变成结束位置
                            ++It2;
                        }
                        // 返回起始位置和结束位置的iterator_range对象
                        return result_type( It, It2 );
                    }
                }

最后看一下 boost::is_any_of这个谓词，

  template<typename RangeT>
        inline detail::is_any_ofF<
            BOOST_STRING_TYPENAME range_value<RangeT>::type> 
        is_any_of( const RangeT& Set )
        {
            iterator_range<BOOST_STRING_TYPENAME range_const_iterator<RangeT>::type> lit_set(boost::as_literal(Set));
            return detail::is_any_ofF<BOOST_STRING_TYPENAME range_value<RangeT>::type>(lit_set); 
        }

 template< class Range >
    inline iterator_range<BOOST_DEDUCED_TYPENAME range_iterator<Range>::type>
    as_literal( Range& r )
    {
        return range_detail::make_range( r, range_detail::is_char_ptr(r) );
    }

        template< class T >
        inline iterator_range<T*>
        make_range( T* const r, bool )
        {
            // 其实是返回 字符串起始位置到终止位置的迭代器
            return iterator_range<T*>( r, r + length(r) );
        }

所以说，boost::split_iterator是使用std::find_if在一个迭代器中查找另一个迭代器，然后依次更新起始迭起器和终止迭代器的位置，查找子串。

使用boost::split_iterator进行字符串分割
代码非常简单，实际上就是根据一个分割字符串组合，来返回分割后的字符串列表。在C++中，实际返回一个iterator...
iOS 字符串处理：截取字符串、匹配字符串、分割字符串、
若获取某指定字符串，可重复使用第三步分割字符串进行分割知道获取想要的
使用flutter_boost混合开发时，android端返回键
flutter_boost版本(1.17.1, 1.22.4) 在使用flutter_boost进行混合开发时，有...
shell 字符串分割与循环数组
字符串分割以 #进行分割stringvalue 循环遍历遍历stringvalueArray 普通的for循环 ...
使用多个分隔符分隔字符串
python 多分隔符分隔字符串 python内建split方法不能使用多个分割符来分割字符串,可以使用re模块的...
swift5.0 数组Array的常用方法总结
· 数组与字符串的互转数组转字符串字符串转数组没有分隔符的字符串分割有分隔符的字符串分割注释：在使用分隔...
LeetCode Serialize and Deseriali
使用了sstream，比较方便根据空格分割字符串。
javascript隐藏手机号中间4位两种方法
使用字符串分割法正则匹配验证替换
ruby koans—strings中split用法
split(pattern=nil, [limit]) → an_array 是将字符串进行分割成子字符串，返回这...
算法- 单词拆分
题目：分析题意：将s进行分割，分割后的多个字符串是否都可以在wordDict中找到。第一种想法：将s进行分割...