用 trie 统计词频的 benchmark

统计一段话内单词出现的次数，最简单的做法为 split 后进行 hash 统计，而经典的答案则是用 trie 数据结构来统计词频。即使目前js引擎的突飞猛进，这种经典的算法仍是必要的，以下简单构造了一个性能测试来展示下：

数据

测试数据为 wiki 的第一段话重复指定次数后形成的字符串：

In computer science, a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key with which it is associated. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest. For the space-optimized presentation of prefix tree, see compact prefix tree.

结果

测试结果如下：

running… string length:778

split&hash count x 10,568 ops/sec ±1.74% (59 runs sampled)

trie count x 12,514 ops/sec ±1.30% (92 runs sampled)

fastest is trie count

running… string length:1556

split&hash count x 6,039 ops/sec ±3.51% (92 runs sampled)

trie count x 8,020 ops/sec ±3.60% (45 runs sampled)

fastest is trie count

running… string length:7780

split&hash count x 1,407 ops/sec ±2.36% (92 runs sampled)

trie count x 2,589 ops/sec ±1.11% (18 runs sampled)

fastest is trie count

running… string length:77800

split&hash count x 148 ops/sec ±0.93% (86 runs sampled)

trie count x 281 ops/sec ±1.59% (86 runs sampled)

fastest is trie count

running… string length:778000

split&hash count x 13.09 ops/sec ±5.15% (35 runs sampled)

trie count x 27.40 ops/sec ±3.64% (43 runs sampled)

fastest is trie count

可见当数据比较小时， trie 数据结构的优势还不明显，当数据量越来越大时优势就很明显了，不过极限也不到三倍，也体现了 v8 引擎的优化，另外也和测试数据单词多样性也不够有关。

try

try it yourself

	yiminghe发表在《期待你的加入/We’re hiring》
	guzhongren发表在《使用 rml(React Markup Language)…》
	guzhongren发表在《期待你的加入/We’re hiring》
	haishu发表在《f8 2017 见闻》
	OneNewLife发表在《我的 React 2015 之路》

{focus: web}

用 trie 统计词频的 benchmark

数据

结果

try

《用 trie 统计词频的 benchmark》有1条评论

留下评论取消回复

用 trie 统计词频的 benchmark

数据

结果

try

分享到：

相关

《用 trie 统计词频的 benchmark》有1条评论

留下评论 取消回复

留下评论取消回复