Indexing documents: how to setup?


I'm new to Lucene, and am trying to setup an advanced search on a Dnn 07.02.02 host. I've got latest version of LuceneSearch, install was ok and searching into standard content is right.

I now want to enable indexing of PDF and office (xlsx, docx) files, but am getting errors and do not know how to go on.
I installed Adobe IFilter v.6 (http://www.adobe.com/support/downloads/detail.jsp?ftpID=2611), and also the 64bit version of Microsoft filters for Office 2007 (http://www.microsoft.com/downloads/details.aspx?FamilyId=60C92A37-719C-4077-B5C6-CAC34F4227CC&displaylang=en).
I then had an error that a dll was mising (bcprov-jdk15-1.44), so I found a DLL package at http://www.squarepdf.net/pdfbox-in-net (version 1.8.4, also tried with v. 1.7), but I now have this error when I launch lucene indexing:
2014-07-23 16:28:39,531 [JPWEB2011][Thread:26][ERROR] DotNetNuke.Services.Exceptions.Exceptions - FriendlyMessage="Error:  is currently unavailable." ctrl="ASP.desktopmodules_aricie_lucenesearch_controls_settings_indexersettings_ascx" exc="System.IO.FileLoadException: Could not load file or assembly 'IKVM.OpenJDK.Core, Version=, Culture=neutral, PublicKeyToken=13235d27fcbfff58' or one of its dependencies. The located assembly's manifest definition does not match the assembly reference. (Exception from HRESULT: 0x80131040)
File name: 'IKVM.OpenJDK.Core, Version=, Culture=neutral, PublicKeyToken=13235d27fcbfff58'
   at Aricie.Documents.Services.Pdf.PdfBoxReader.GetDocumentText(String fileName)
   at Aricie.Documents.Services.DocumentFile.ReadDocumentText(Int32 maxLengthChars)
   at Aricie.DNN.Modules.LuceneSearch.Business.DocumentController.GetFieldsFromFile(String documentPath, Boolean store)
   at Aricie.DNN.Modules.LuceneSearch.Business.DocumentController.GetDocumentFromFile(String documentPath, Boolean store)
   at Aricie.DNN.Modules.LuceneSearch.ModuleProviders.StandAloneDocumentProvider.UpgradeSearchItem(Int32 portalId, LuceneSearchItemInfo& searchItem)
   at Aricie.DNN.Modules.LuceneSearch.Business.LuceneIndexer.UpgradeSearchItems(Int32 portalID, SearchItemInfoCollection nativeItems, LucenePortalSettings settings, ExistingIndexStructure`1 existingIndex)

I tried substituing all present dll with the same name with the new version from pdfbox package, but am not able to solve.
Can anybody help?

Thank you,
Closed Aug 7, 2014 at 7:41 AM by Stephane_TETARD


trapias wrote Jul 23, 2014 at 3:28 PM

Update: I tried replacing DLLs with version 0.46 of ivkm, but continue di get errors.
Which ikvm version should I use with this version of lucenesearch? If I put 0.46 I now et an error that it needs a newer version:
2014-07-23 17:27:24,611 [JPWEB2011][Thread:17][ERROR] DotNetNuke.Services.Exceptions.Exceptions - System.Exception:  Détail : c:\inetpub\wwwroot\stage\Portals\0\Repository\2012.4b46c489-6f62-4295-b284-3f5554ea8010.pdf ---> System.IO.FileLoadException: Could not load file or assembly 'IKVM.Runtime, Version=7.2.4630.5, Culture=neutral, PublicKeyToken=13235d27fcbfff58' or one of its dependencies. The located assembly's manifest definition does not match the assembly reference. (Exception from HRESULT: 0x80131040)
   at System.ModuleHandle.ResolveType(RuntimeModule module, Int32 typeToken, IntPtr* typeInstArgs, Int32 typeInstCount, IntPtr* methodInstArgs, Int32 methodInstCount, ObjectHandleOnStack type)
   at System.ModuleHandle.ResolveTypeHandleInternal(RuntimeModule module, Int32 typeToken, RuntimeTypeHandle[] typeInstantiationContext, RuntimeTypeHandle[] methodInstantiationContext)
   at System.ModuleHandle.ResolveTypeHandle(Int32 typeToken, RuntimeTypeHandle[] typeInstantiationContext, RuntimeTypeHandle[] methodInstantiationContext)
   at System.Reflection.RuntimeModule.ResolveType(Int32 metadataToken, Type[] genericTypeArguments, Type[] genericMethodArguments)
   at System.Reflection.CustomAttribute.FilterCustomAttributeRecord(CustomAttributeRecord caRecord, MetadataImport scope, Assembly& lastAptcaOkAssembly, RuntimeModule decoratedModule, MetadataToken decoratedToken, RuntimeType attributeFilterType, Boolean mustBeInheritable, Object[] attributes, IList derivedAttributes, RuntimeType& attributeType, IRuntimeMethodInfo& ctor, Boolean& ctorHasParameters, Boolean& isVarArg)
   at System.Reflection.CustomAttribute.GetCustomAttributes(RuntimeModule decoratedModule, Int32 decoratedMetadataToken, Int32 pcaCount, RuntimeType attributeFilterType, Boolean mustBeInheritable, IList derivedAttributes, Boolean isDecoratedTargetSecurityTransparent)
   at System.Reflection.CustomAttribute.GetCustomAttributes(RuntimeModule module, RuntimeType caType)
   at IKVM.Internal.AssemblyClassLoader.AssemblyLoader..ctor(Assembly assembly)
   at IKVM.Internal.AssemblyClassLoader.Create(Assembly assembly)
   at IKVM.Internal.AssemblyClassLoader.FromAssembly(Assembly assembly)
   at IKVM.Internal.AssemblyClassLoader.LoadReferenced(String name)
   at IKVM.NativeCode.ikvm.runtime.AssemblyClassLoader.LoadClass(Object classLoader, Assembly assembly, String name)
   at ikvm.runtime.AssemblyClassLoader.loadClass(String name, Boolean resolve)
   at java.lang.ClassLoader.loadClass(String name)
   at org.apache.pdfbox.util.TextNormalize.findICU4J()
   at org.apache.pdfbox.util.TextNormalize..ctor(String encoding)
   at org.apache.pdfbox.util.PDFTextStripper..ctor()
   at Aricie.Documents.Services.Pdf.PdfBoxReader.GetDocumentText(String fileName)
   --- End of inner exception stack trace ---

Aricie wrote Jul 24, 2014 at 10:20 AM

Hi, it might be that last package was incomplete.
I will look into publishing an updated version of the package together with a new Aricie.Shared Package.
I'll let you know when I'm done.

Stephane_TETARD wrote Aug 4, 2014 at 12:41 PM

we just released a new package for Aricie Shared which contains the missing dll for the pdf indexing:bcprov-jdk14-132.dll.
I suggest you to remove all the dll of pdfbox and ikvm you've uploaded on your websites. Then, you could install this new version of Aricie Shared and repair the install of LuceneSearch with the original package in order to have the full list of the corrects dll.

trapias wrote Aug 5, 2014 at 3:34 PM

I installed the new shared libraries v. 1.8.00, and I confirm it solves the problem for me - no more errors when indexing, thank you!

wrote Aug 7, 2014 at 7:41 AM