I’m currently working on a project involving a large amount of database processing on the backend. The last week has been dedicated to getting a script running to take the crappy data from multiple DBs, process it into something nice, and inserting it into my working database. The original stab at the code with no attempt to speed it up wound up taking about 4 seconds per processed item. This particular database had about 170,000 items to process, so the import script would take something near 189 hours. I broke out the profiler and found over 99% of the time was wasted while waiting for mysql select queries to come back. My initial round of changes helped immensely (from 4.0 to 0.25 seconds per item), but we were still at almost 12 hours to run. That’s when I ported the code to python and started testing threads.
Before I get into it, let’s make this crystal clear: THIS IS NOT GREAT TEST DATA. DRAW NO CONCLUSIONS FROM THIS. This was not a robust test in any way. I continued to make time saving changes and bug fixes as they were discovered, so even the time between runs can’t be relied upon. Nor was this on an isolated server — this was done on a production server currently hosting dozens of websites with unknown traffic and load. Even the test numbers are too small to make real assessments of performance. Finally, this applies ONLY to my precise script and the algorithm I’m using. You will get different results with a different project. It would be foolish to take anything concrete from these numbers.
That being said, the results are still interesting to consider and they should encourage you to do your own testing when building your next database heavy project.
Up to this point I had been looking at the time/processed item, so I started there after throwing together the threaded importer. Here are the results of running the script with a set number of entries for each thread to process with varying numbers of threads.
| Seconds Per Processed Item | ||
|---|---|---|
| Threads | 100 Items per Thread (Total) | 250 Items per Thread (Total) |
| 1 | .118 (100) | .106 (100) |
| 2 | .097 (200) | .106 (250) |
| 3 | .095 (300) | .110 (500) |
| 4 | .102 (400) | .114 (750) |
| 5 | .106 (500) | .115 (1250) |
| 8 | .112 (800) | |
| 10 | .116 (1000) | |
| 12 | .120 (1200) | |
| 15 | .120 (1500) | |
Not at all what I expected, but with the vastly different number of total items being processed, I decided to take another approach. Here is the same script, but this time with a set TOTAL number of entries to process (5000). In other words, the above data did X items per thread, so the total number of processed items was threads * X. Now, I’m assigning each thread 5000/threads entries. This makes much more sense for the project (the first test was plain stupid) and, as you can see, the results are more what you would expect.
| 5000 Items Evenly Across All Threads | ||
|---|---|---|
| Threads | Total Time | Time per Item |
| 1 | 08:17.21 | 0.0994 |
| 2 | 07:13.33 | 0.0867 |
| 3 | 06:41.97 | 0.0804 |
| 4 | 06:32.97 | 0.0786 |
| 5 | 06:27.36 | 0.0775 |
| 8 | 06:23.59 | 0.0767 |
| 10 | 06:24.21 | 0.0768 |
| 12 | 06:24.24 | 0.0769 |
| 15 | 06:23.25 | 0.0767 |
Now you can see that the threading is making a sizeable difference – almost 23% at its best. An interesting thing to note is that, for this script, the benefits of threading drop off at about 8. Obviously at that point the threads just get in each other’s way instead of keeping the database fully loaded. Overall, the lessons to be learned here are 1. test your project in multiple different ways and 2. threading will likely help out if you’re wasting the majority of your time waiting for database selections (as opposed to processing data).
Recent Comments