Indexing Legacy Data with NHibernate.Search
While configuring NHibernate.Search, I ran into an issue while attempting to batch process a million or so legacy records. When I created the index directly from Lucene.Net, things were speedy and working as expected. When I created the index via NHibernate.Search, the indexer was generating way too many index files, numbering into the hundreds of thousands. As a result, the number of file operations was increasing drastically with each iteration of the indexer, such that the FullTextSession.Index call would never finish.
I spent a long time messing about with different merge factors and max file parameters for Lucene.Net, but I was never able to make it work as I expected. The solution ended up being to force an optimize on the index after a certain number of records. Optimizing a Lucene index is analogous to defragging a harddrive; it orders and compacts the thousands of splintered .cfs files into one big file, thereby solving the problem of having to scan a growing number of files before each write.
Here is my generic CreateIndex method that includes periodic optimization. This ended up solving the problem and allowed me to index 1.5 million legacy records in about 3 hours. This code depends on a specific finder implementation, as well as a generic method for optimizing an index, but it should be enough to get the idea across.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | public static void CreateIndex(int batchSize) { Type type = typeof(T); //Get the query object for the type to be indexed object finder = Find.Factory.ResolveFinderFor(type); var method = finder.GetType().GetMethod("get_All"); var objectQuery = method.Invoke(finder, null) as IQueryable; IFullTextSession fullTextSession = Search.CreateFullTextSession(NH.CurrentSession); var total = objectQuery.Count(); var iterations = total / batchSize; const int optimizeThreshold = 10000; var optimizeThresholdCounter = 0; //Find the generic optimize method MethodInfo optimizeMethod = typeof(IndexHelper).GetMethod("OptimizeIndex"); //Make it generic for the type in question MethodInfo genericOptimizeMethod = optimizeMethod.MakeGenericMethod(type); for (var i = 0; i < iterations; i++) { var subset = objectQuery.Skip(i * batchSize).Take(batchSize).ToList(); int startCount = (i*batchSize); int endCount = startCount + batchSize; optimizeThresholdCounter += batchSize; var tx = fullTextSession.BeginTransaction(); foreach (T instance in subset) { fullTextSession.Index(instance); } tx.Commit(); fullTextSession.Flush(); fullTextSession.Clear(); //If we've hit the threshold, optimize if(optimizeThreshold != 0 && optimizeThresholdCounter >= optimizeThreshold) { genericOptimizeMethod.Invoke(null, null); optimizeThresholdCounter = 0; } } //optimize the index one final time genericOptimizeMethod.Invoke(null, null); } |
I hope this saves someone some headaches - I know I wasted a lot of time finding this solution.

January 29th, 2010 - 05:32
Cheers, good job! You inspired me when I had the same problem – and I published a slightly different implementation of it on my site at http://www.stewartwhiting.com/wp/?p=55
Cheers,
Stewart.