统计一段话内单词出现的次数,最简单的做法为 split 后进行 hash 统计,而经典的答案则是用 trie 数据结构来统计词频。即使目前js引擎的突飞猛进,这种经典的算法仍是必要的,以下简单构造了一个性能测试来展示下:
trie 实现文件 split hash 实现文件 benchmark
数据
测试数据为 wiki 的第一段话重复指定次数后形成的字符串:
In computer science, a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key with which it is associated. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest. For the space-optimized presentation of prefix tree, see compact prefix tree.
结果
测试结果如下:
running… string length:778
split&hash count x 10,568 ops/sec ±1.74% (59 runs sampled)
trie count x 12,514 ops/sec ±1.30% (92 runs sampled)
fastest is trie count
running… string length:1556
split&hash count x 6,039 ops/sec ±3.51% (92 runs sampled)
trie count x 8,020 ops/sec ±3.60% (45 runs sampled)
fastest is trie count
running… string length:7780
split&hash count x 1,407 ops/sec ±2.36% (92 runs sampled)
trie count x 2,589 ops/sec ±1.11% (18 runs sampled)
fastest is trie count
running… string length:77800
split&hash count x 148 ops/sec ±0.93% (86 runs sampled)
trie count x 281 ops/sec ±1.59% (86 runs sampled)
fastest is trie count
running… string length:778000
split&hash count x 13.09 ops/sec ±5.15% (35 runs sampled)
trie count x 27.40 ops/sec ±3.64% (43 runs sampled)
fastest is trie count
可见当数据比较小时, trie 数据结构的优势还不明显,当数据量越来越大时优势就很明显了,不过极限也不到三倍,也体现了 v8 引擎的优化,另外也和测试数据单词多样性也不够有关。
赞一个!最近有空吗?我在QQ上找你,好像不在呀,好久没联系,有时间聊一下。