用 trie 统计词频的 benchmark

统计一段话内单词出现的次数,最简单的做法为 split 后进行 hash 统计,而经典的答案则是用 trie 数据结构来统计词频。即使目前js引擎的突飞猛进,这种经典的算法仍是必要的,以下简单构造了一个性能测试来展示下:

trie 实现文件   split hash 实现文件   benchmark

数据

测试数据为 wiki 的第一段话重复指定次数后形成的字符串:

In computer science, a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key with which it is associated. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest. For the space-optimized presentation of prefix tree, see compact prefix tree.

结果

测试结果如下:

running… string length:778

split&hash count x 10,568 ops/sec ±1.74% (59 runs sampled)

trie count x 12,514 ops/sec ±1.30% (92 runs sampled)

fastest is trie count

running… string length:1556

split&hash count x 6,039 ops/sec ±3.51% (92 runs sampled)

trie count x 8,020 ops/sec ±3.60% (45 runs sampled)

fastest is trie count

running… string length:7780

split&hash count x 1,407 ops/sec ±2.36% (92 runs sampled)

trie count x 2,589 ops/sec ±1.11% (18 runs sampled)

fastest is trie count

running… string length:77800

split&hash count x 148 ops/sec ±0.93% (86 runs sampled)

trie count x 281 ops/sec ±1.59% (86 runs sampled)

fastest is trie count

running… string length:778000

split&hash count x 13.09 ops/sec ±5.15% (35 runs sampled)

trie count x 27.40 ops/sec ±3.64% (43 runs sampled)

fastest is trie count

可见当数据比较小时, trie 数据结构的优势还不明显,当数据量越来越大时优势就很明显了,不过极限也不到三倍,也体现了 v8 引擎的优化,另外也和测试数据单词多样性也不够有关。

try

try it yourself

用 trie 统计词频的 benchmark”的一个响应

发表评论

Fill in your details below or click an icon to log in:

WordPress.com 徽标

You are commenting using your WordPress.com account. Log Out /  更改 )

Google photo

You are commenting using your Google account. Log Out /  更改 )

Twitter picture

You are commenting using your Twitter account. Log Out /  更改 )

Facebook photo

You are commenting using your Facebook account. Log Out /  更改 )

Connecting to %s