美文网首页
正则表达式 Regular Expression

正则表达式 Regular Expression

作者: 钊钖 | 来源:发表于2017-12-18 12:41 被阅读0次
    image.png

    正则表达式 Regular Expression

    正则表达式是一种对字符串过滤的逻辑公式

    • 可以判断给定的字符串是否匹配
    • 可以获取字符串中特定的部分

    从dataquest 的联系中掌握一些常用的用法

    1. introduction (instructions)

    In the code cell, assign to the variable regex a regular expression that's four characters long and matches every string in the list strings.

    strings = ["data science", "big data",metadata]
    regex = 'data'
    

    2. Wildcards in Regular Expressions(instructions)

    In Python, we use the re module to work with regular expressions. The module's documentation provides a list of these special characters.

    For instance, we use the special character "." to indicate that any character can be put in its place.

    Assign a regular expression that is three characters long and matches every string in strings to the variable regex.

    strings = ["bat",'robotics','megabyte']
    regex = "b.t"
    

    3. Searching The Beginnings And Endings Of Srtings(instructions)

    We can use the caret symbol ("^") to match the beginning of a string, and the dollar sign ("$") to match the end of a string.

    Assign a regular expression that's seven characters long and matches every string in strings (except for bad_string) to the variable regex.

    strings = ["better not put too much", "butter in the", "batter"]
    bad_string = "We also wouldn't want it to be bitter"
    regex = ""
    regex = '^b.tter'
    

    4. Introduction to the AskReddit Data Set

    which has five columns that appear in the following order:

    Title -- The title of the post Score -- The number of upvotes the post received
    Time -- When the post was posted
    Gold -- How much Reddit Gold users gave the post
    NumComs -- The number of comments the post received

    5. Reading and Pringting the Data Set(instructions)

    Title|Score|Time|Gold|NumComs
    ---| ---| ---|---|---
    What's your internet "white whale", something you've been searching for years to find with no luck?| 11510|1433213314|1|26195
    What's your favorite video that is 10 seconds or less?|8656|1434205517|4|8479
    What are some interesting tests you can take to find out about yourself?|8480|1443409636|1|4055|
    PhD's of Reddit. What is a dumbed down summary of your thesis?|7927|1440188623|0|13201
    What is cool to be good at, yet uncool to be REALLY good at?|7711|1440082910|0|20325
    Let's use the csv module to read and print our data file, "askreddit_2015.csv". Recall that we can use the csv module by performing the following steps:

    1. Import csv.
    2. Open the file that contains our CSV data in 'r' mode.
    3. Call the csv.reader() function with the file object as input.
    4. Convert the result to a list.
    • Use the csv module to read our data set and assign it to posts_with_header.
    • Use list slicing to exclude the first row, which represents the column names. Assign this sliced data set to posts.
    • Use a for loop and string slicing to print the first 10 rows. See if you notice any patterns in this sample of the data set.
    import csv
    post_with_header = list(csv.reader(open("askreddit_2015.csv",'r')))
    posts = post_with_header[1:]
    for post in posts[:10]:
        print(post)
    

    6. Countint Simple Mathes in the Data Set with re()

    We mentioned the re module earlier, and now we'll begin to use it in our code. One useful function the module provides is re.search.

    With re.search(regex, string), we can check whether string is a match for regex. If it is, the expression will return a match object. If it isn't, it will return None. For now, we won't worry about returning the actual matches - we'll just compare the result to None to see whether we have a match or not.

    
    if re.search("needle", "haystack") is not None:
       print("We found it!")
    else:
       print("Not a match")
    

    The code above will print Not a match, because "haystack" is not a match for the regex "needle".

    You may have noticed that many of the posts in our AskReddit database are directed towards particular groups of people, using phrases like "Soldiers of Reddit". These types of posts are common, and always follow a similar format. We can use regular expressions to count how many of them are in the top 1,000.Let's do this in our next exercise. We've already read the data set into the variable posts.

    Instructions

    Count the number of posts in our data set that match the regex "of Reddit". Assign the count to of_reddit_count.

    import re
    of_reddit_count = 0 
    for post in posts:
        if re.search('of Reddit',post[0]) is not None:
            of_reddit_count += 1
    print(of_reddit_count)
    

    7. Using Square Brackets to Match Multiple Characters

    For example, the regex "[bcr]at" would match the substrings "bat", "cat", and "rat", but nothing else. We indicate that the first character in the regex can be either "b", "c" or "r".

    Instructions

    • Use square bracket notation to make the code account for both capitalizations of "Reddit", and count how many posts contain "of Reddit" or "of reddit" in the title.
    • Assign the resulting count to of_reddit_count.
    improt re
    of_reddit_count = 0 
    for post in posts:
        if re.search ('of [rR]eddit',post[0]) is not None:
            of_reddit_count += 1
    

    8. Excaping Special Characters

    To deal with this sort of problem, we need to escape (backslash \ )special characters.

    Instructions
    -Escape the square bracket characters to count the number of posts in our data set that contain the "[Serious]" tag.

    • Assign the count to serious_count.
    import re
    serious_count = 0
    for post in posts:
        if re.search('\[Serious\]',post[0])is not None:
            serious_count +=1
    

    9. Combining Escaped Characters and Multiple Matches

    Some people tag serious posts as "[Serious]", and others as "[serious]". We should account for both capitalizations.

    Instructions

    • Refine the code to count how many posts have either "[Serious]" or "[serious]" in the title.
    • Assign the count to serious_count.
    improt re
    serious_count = 0
    for post in posts:
        if re.search ('\[[Ss]erious\]',post[0]):
            serious_count += 1
    

    10. Adding More Complexity to Your Regular Expression

    In our data set, some users have tagged their posts with "(Serious)" or "(serious)", including the parentheses. Therefore, we should account for both square brackets and parentheses. We can do this by using square bracket notation, and escaping the "[", "]", "(", and ")" characters with the backslash.

    Instructions

    • Refine the code so that it counts how many posts have the serious tag enclosed in either square brackets or parentheses.
    • Assign the count to serious_count.
    import re
    serious_count =0
    for post in posts:
        if re.search('[\[\(][Ss]rious[\]\)]',post[0]) is not None:
            serious_count += 1
    

    11. Combining Multiple Regular Expressions

    To combine regular expressions, we use the "|" character.

    Instructions

    • Use the "^" character to count how many posts include the serious tag at the beginning of the title. Assign this count to serious_start_count.
    • Use the "$" character to count how many posts include the serious tag at the end of the title. Assign this count to serious_end_count.
    • Use the "|" character to count how many posts include the serious tag at either the beginning or end of the title. Assign this count to serious_count_final.
    import re
    
    serious_start_count = 0
    serious_end_count = 0
    serious_count_final = 0
    
    for row in posts:
        if re.search('^[\[\(][Ss]erious[\]\)]',row[0])is not None:
            serious_start_count+=1
    for row in posts:
        if re.search('[\[\(][Ss]erious[\]\)]$',row[0]) is not None:
            serious_end_count +=1
    for row in posts:
        if re.search('^[\[\(][Ss]erious[\]\)]|[\[\(][Ss]erious[\]\)]$',row[0])is not None:
            serious_count_final +=1
    

    12. Using Regular Expressions to Substitute Strings

    The re module provides a sub() function that takes the following parameters (in order):

    • pattern: The regex to match
    • repl: The string that should replace the substring matches
    • string: The string containing the pattern we want to search

    Instructions

    • Replace "[serious]", "(Serious)", and "(serious)" with "[Serious]" for all of the titles in posts.
    • You should only need to use one call to sub(), and one regex.
    • Recall that the repl argument is an ordinary string. It's not a regex, so you don't need to escape characters like "[".

    Hint

    "[\[\(][Ss]erious[\]\)]" is the pattern argument to sub(), and "[Serious]" is the repl argument.

    import re
    for row in posts:
        re.sub('[\]\)][sS]erious[\]\)]','[Serious]',row[0])
    
    

    13. Matching Years with Regular Expressions

    We can indicate that we're looking for integers in a pattern by using square brackets ("[" and "]"), along with a dash ("-"). For example, "[0-9]" will match any character that falls between 0 and 9 (all of which will be one-digit integers). Similarly, "[a-z]" would match any lowercase letter. We can also specify smaller ranges like "[3-5]" or "[d-g]".

    This would work, but let's also add the condition that we only want to match years after year 999 and before year 3000 (any other four-digit numbers in a string are probably not years).

    Instructions

    • We've loaded a number of strings into the strings variable for you.
    • Loop through strings and use re.search() to determine whether each string contains a year between 1000 and 2999.
    • Store every string that contains a year in year_strings. The .append() function will help here.
    import re
    year_string = []
    for string in strings:
        if re.search ('[1-2][0-9][0-9][0-9]',string)is not None:
            year_strings_append(string)
    

    14. Repeating Characters in Regular Expressions

    We can use curly brackets ("{" and "}") to indicate that a pattern should repeat. To match any four-digit number, for example, we could repeat the pattern "[0-9]" four times by writing "[0-9]{4}"

    Instructions

    • We've loaded a number of strings into the strings variable for you.
    • Loop through strings and use re.search() to determine whether each string contains a year between 1000 and 2999. Use a regex that takes advantage of curly brackets.
    • Store every string that contains a year in year_strings. The .append() function will help here.
    import re
    year_srings = []
    for string in strings:
        if re.search('[1-2][0-9]{3}',string)is not None:
            year_strings.append(string)
    

    15 . Challenge: Extracting all Years

    Finally, let's extract years from a string. The re module contains a findall() function that returns a list of substrings matching the regex. re.findall("[a-z]", "abc123") would return ["a", "b", "c"], because those are the substrings that match the regex.

    Instructions

    • Use re.findall() to generate a list of all years between 1000 and 2999 in the string years_string.
    • Assign the result to years.
    years = re.finall('[1-2][0-9]{3}',years_string)
    

    相关文章

      网友评论

          本文标题:正则表达式 Regular Expression

          本文链接:https://www.haomeiwen.com/subject/cczpwxtx.html