LeetCode 187 Repeated DNA Sequen

作者: ShuiLocked | 来源:发表于2016-08-29 18:46 被阅读787次

LeetCode 187 Repeated DNA Sequen
2019-01-18
LeetCode 186-190
[位运算]187. Repeated DNA Sequences
Leetcode 187 - Repeated DNA Sequ
Leetcode 187. Repeated DNA Seque
[leetcode]187. Repeated DNA Sequ
[Leetcode] 187. Repeated DNA Seq
Leetcode 187. Repeated DNA Seque
LeetCode187-Repeated DNA Sequenc

LeetCode 187 Repeated DNA Sequences

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

上来只能想到暴力搜索。。。扫描字符串，对于每一个10bit的substring，check if除了这个substring以外，string的后半部分是否仍然含有这个substring，可以用string.index(str)，若不是-1，则说明存在。

但显然超时了。。。

为了防止每次都搜索一遍substring是否存在，不防将已经出现过的10bit string放到一个hashset中，然后对于往后的substring，都检查是否出现过。这样做的坏处是：存储所有10bit string可能需要消耗过多内存。

看discuss发现了很tricky的hashset+bit manipulation的方法。
bit manipulation真得挺难的，遇到这类题一直没啥思路。。。

对于搜索类的题，如果用hash map，那有一个问题是，再好的hash function，都可能存在collision，那为什么这题用hash，仍然可以避免collision呢？

这就是因为，本题的string只可能有4种字符（ATCG），再结合bit manipulation就可以创造一种不发生冲突的方式！

同时减少memory storage，因为直接存10bit的substring，太浪费内存了！！！

我们希望尽量减少存储所用的bit数量，而因为仅有4种字符，考虑使用两位二进制数，表示这四个字符！！！

0 = 00 (bits in binary number system) = 'A'
1 = 01 (bits in binary number system) = 'C'
2 = 10 (bits in binary number system) = 'G'
3 = 11 (bits in binary number system) = 'T'

因此10bit的substring只需要10*2=20位，就可以表示，而一般的integer都需要4bytes也就是32位。

以"AACCTCCGGT" 为例：
A A C C T C C G G T
00 00 01 01 11 01 01 10 10 11 = 00000101110101101011 (binary) = 23915 (decimal)

用两个hashset分别存储出现过的substring，和已经出现过两次的substring，某substring出现第二次时，将其加入结果list中。

注意这里v就是20bit的string转成的数，用于存储在hashset中！！！
这里运用到左移与或运算，还有0xfffff的mask运算。

Set<Integer> seq = new HashSet<>();
Set<Integer> repeatedSeq = new HashSet<>();

v = (v<<2 | map.get(s.charAt(i))) & 0xfffff; 

if (!seq.add(v) && repeatedSeq.add(v))

代码：

public class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        List<String> seqs = new ArrayList<>();
        Set<Integer> seq = new HashSet<>();
        Set<Integer> repeatedSeq = new HashSet<>();
        HashMap<Character,Integer> map = new HashMap<Character,Integer>();
        map.put('A',0);
        map.put('C',1);
        map.put('G',2);
        map.put('T',3);
        
        int v = 0;
        // Use a sliding window to check every 10-bit substring
        for (int i = 0; i < s.length(); i++) {
            // 2 bits/char * 10 char = 20 bits so use 0xfffff
            v = (v<<2 | map.get(s.charAt(i))) & 0xfffff; 
            if (i < 9) continue;
            // Check each 10-bit substring
            else {
                // If first come out duplicates, then add to list
                if (!seq.add(v) && repeatedSeq.add(v))
                    seqs.add(s.substring(i-9, i+1));
            }
        }
        
        return seqs;
    }
}

参考：
https://discuss.leetcode.com/topic/8894/clean-java-solution-hashmap-bits-manipulation/9

网友评论

本文标题：LeetCode 187 Repeated DNA Sequen

本文链接：https://www.haomeiwen.com/subject/opaxettx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

LeetCode 187 Repeated DNA Sequen

LeetCode 187 Repeated DNA Sequences

相关文章