Brandon M. West

21Dec/091

Indexing Legacy Data with NHibernate.Search

While configuring NHibernate.Search, I ran into an issue while attempting to batch process a million or so legacy records. When I created the index directly from Lucene.Net, things were speedy and working as expected. When I created the index via NHibernate.Search, the indexer was generating way too many index files, numbering into the hundreds of thousands. As a result, the number of file operations was increasing drastically with each iteration of the indexer, such that the FullTextSession.Index call would never finish.

I spent a long time messing about with different merge factors and max file parameters for Lucene.Net, but I was never able to make it work as I expected. The solution ended up being to force an optimize on the index after a certain number of records. Optimizing a Lucene index is analogous to defragging a harddrive; it orders and compacts the thousands of splintered .cfs files into one big file, thereby solving the problem of having to scan a growing number of files before each write.

Here is my generic CreateIndex method that includes periodic optimization. This ended up solving the problem and allowed me to index 1.5 million legacy records in about 3 hours. This code depends on a specific finder implementation, as well as a generic method for optimizing an index, but it should be enough to get the idea across.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
public static void CreateIndex(int batchSize)
{
Type type = typeof(T);

//Get the query object for the type to be indexed
object finder = Find.Factory.ResolveFinderFor(type);
var method = finder.GetType().GetMethod("get_All");
var objectQuery = method.Invoke(finder, null) as IQueryable;

IFullTextSession fullTextSession =
Search.CreateFullTextSession(NH.CurrentSession);

var total = objectQuery.Count();
var iterations = total / batchSize;

const int optimizeThreshold = 10000;
var optimizeThresholdCounter = 0;

//Find the generic optimize method
MethodInfo optimizeMethod =
typeof(IndexHelper).GetMethod("OptimizeIndex");

//Make it generic for the type in question
MethodInfo genericOptimizeMethod =
optimizeMethod.MakeGenericMethod(type);

for (var i = 0; i < iterations; i++)
{
var subset = objectQuery.Skip(i * batchSize).Take(batchSize).ToList();

int startCount = (i*batchSize);
int endCount = startCount + batchSize;

optimizeThresholdCounter += batchSize;

var tx = fullTextSession.BeginTransaction();
foreach (T instance in subset)
{
fullTextSession.Index(instance);
}
tx.Commit();

fullTextSession.Flush();
fullTextSession.Clear();

//If we've hit the threshold, optimize
if(optimizeThreshold != 0 &&
optimizeThresholdCounter >= optimizeThreshold)
{
genericOptimizeMethod.Invoke(null, null);
optimizeThresholdCounter = 0;
}
}

//optimize the index one final time
genericOptimizeMethod.Invoke(null, null);
}

I hope this saves someone some headaches - I know I wasted a lot of time finding this solution.

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 5.00 out of 5)
Loading ... Loading ...

Comments (1) Trackbacks (0)
  1. Cheers, good job! You inspired me when I had the same problem – and I published a slightly different implementation of it on my site at http://www.stewartwhiting.com/wp/?p=55

    Cheers,
    Stewart.


Leave a comment


No trackbacks yet.

Page optimized by WP Minify WordPress Plugin