PHP Classes

Question about performance

Recommend this page to a friend!

      Fuzzy Index  >  All threads  >  Question about performance  >  (Un) Subscribe thread alerts  
Subject:Question about performance
Summary:Benchmarking 3.000.000 items
Messages:2
Author:Richard Keizer
Date:2012-06-17 17:00:25
Update:2012-06-17 19:34:44
 

  1. Question about performance   Reply   Report abuse  
Picture of Richard Keizer Richard Keizer - 2012-06-17 17:00:25
Say I have a database containing 3.000.000 names (max. length 255 chars).
How will the algorithm perform? Have your ran any benchmarks on this?

Thanks for the great contribution!

  2. Re: Question about performance   Reply   Report abuse  
Picture of Philipp Strazny Philipp Strazny - 2012-06-17 19:34:44 - In reply to message 1 from Richard Keizer
Hi Richard,

thank you for the question.

For testing, I ran book-length texts such as Königliche Hoheit, Thomas Mann, txt version, downloaded from gutenberg.org.

The text contains ca. 8500 lines with 100.000 words and 750.000 characters. The following table shows how long it takes to load all strings into the DB, what size the DB is afterwards, and how long it takes to look up 8 different slightly modified items with their scores (on a laptop with 1.8GHz processor and 2GB RAM):

CharsHeuristic:
loadtime 112 s
dbsize: 20 MB
lookup 8 items: 1.7 s
avg score: 95.25%

WordHeuristic:
loadtime: 27 s
dbsize: 6.2 MB
lookup 8 items: 0.1 s
avg score: 93%

The other heuristics fall somewhere in between.

I have not run tests with very large collections as you suggest, because my guess is that searching (and loading) an index for 3 million items of 255 characters would be way too slow to be practical. Depending on the heuristic you use, the number of items in the database would increase dramatically. For example, CharsHeuristic could store up to 750 items for a single 255 character string - this all depends on what kind of data you are loading.