系统设计题目1-First non repeating word

作者: 小王同学加油 | 来源:发表于2019-04-09 15:55 被阅读0次

系统设计题目1-First non repeating word
[LeetCode]387. First Unique Char
[LeetCode By Go 43]387. First Un
[LeetCode]First Unique Character
codewar练习--1
First Unique Character in a Stri
Amazon-First Unique Character in
387. First Unique Character in a
387. First Unique Character in a
LeetCode387. First Unique Charac

题目： First non repeating word in a file? File size can be 100GB.

Given a text file of size say 100GB? Task is to find first non repeating word in this file?
constraints: You can traverse the file only once.

给你一个文本，里面构成都是单词，寻找到第一步不重复的单词，‘
假设最后一个单词不重复，需要遍历全部。
假设第一个单词不重复，仍然需要遍历全部。

讨论

目标：First non repeating word

不能改变事情：当前字符需要和全部字符比较一遍，才能确定，是否重复，
无论采用什么结构存储，第一次出现肯定是没有，不能决定是唯一的，需要保证后面不重复。

舍去100g字眼，舍去只遍历一次字眼如何做，有几个方法。

直接导入到数据库里，然后用sql计算（最简单）
shell sort FILE | uniq -c | sort -n（最省事）
构建一个map结构直接记录单词出现个数。这就出现2个问题（细节是魔鬼）
问题1 map结构 key是什么单词，还是（单词，频次）
--->key是固定大小
--->.begin() //相当于获取了平衡树最左下面（most left）的结点的迭代器

问题2 如何查看value=1的数值
iterator find( const Key& key )
std::find_if(my_map.begin(), my_map.end(), map_value_finder("English"))
问题3 性能如何

关键词：word

英文单词个数： 200,000 个 20w个
最长英语单词长度：45 英文字母
最大占内存：8M 内存

image.png

类似估计一亿整数 100M左右普通机器足够完成计算任务。

一行记录8字节顶多1G左右

关键词：First non repeating

表示：有序，你想到数据结构是

std:priority_queue
priority_queue<Node,vector<Node>,cmp>q;

缺点：不具备修改数据功能，只有push pop ，因为需要统计单词个数，随时发送变化
不适合频繁修改。

std:map

struct count
{
 int m_counts;
int fist_pos;
}

 map<string, count, greater<string> > name_score_map;

满足：value存储单词个数和位置，
步骤：
1全部插入
2 全部遍历 m_counts=1 and 最小

位图只判断存咋不存在。需要二次遍历舍
std:unorder_map 也是key /value 满足条件。

unordered_map和map都满足用那个好呢？
红黑树并不适应所有应用树的领域:
如果数据基本上是静态的，那么让他们待在他们能够插入，并且不影响平衡的地方会具有更好的性能。如果数据完全是静态的，例如，做一个哈希表，性能可能会更好一些

在实际的系统中，例如，需要使用动态规则的防火墙系统，使用红黑树而不是散列表被实践证明具有更好的伸缩性。Linux内核在管理vm_area_struct时就是采用了红黑树来维护内存块的

顺序有关 list 最前面先插入的，最后面的后插入的

take a hashmap and a double link list. Hashmap will contain the count of word and double LL will maintain order. every map will have address of LL's node's address . whenever we encounter any word twice, we can identify it using hashmap and we will delete it's corresponding node from LL. at the end, we will return first node of LL.

汇总：优化

1 将一个大文件分割成n个小文件
2 一个文件一个线程处理
一次读取每个文件单词，存储到unordered_map结构中，unordered_map预设大小
std::unordered_map::reserve
3 统计每个线程count=1的数据并且合并到最终vecotr结构中（局部最优）
4 排序规则：偏移量统计唯一出现并且最前面（整体最优）

网友评论

本文标题：系统设计题目1-First non repeating word

本文链接：https://www.haomeiwen.com/subject/bimpiqtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

系统设计题目1-First non repeating word

题目： First non repeating word in a file? File size can be 100GB.

讨论

舍去100g字眼，舍去只遍历一次字眼如何做，有几个方法。

关键词：First non repeating

相关文章