SharePoint 2007 and ZIP indexing


Introduction

Here's a post about indexing ZIP archives in the same style as the one I did on PDF indexing. The search engine makes use of IFilters to be able to read the specific structure of a certain file type and retrieve information from it that it puts in an index. When you perform a search query you will see the information from the index. If it weren't for IFilters you could only search on file name and metadata.

[Indexing Server]: the server(s) in the SharePoint Farm that has/have the "Indexing" Role assigned. In a small farm this can be a single server for all roles.

[Web Front End Server]: the server(s) in the SharePoint Farm  that has/have the "Web Front End" Role assigned. In a small farm this can be a single server for all roles.

Windows SharePoint Services 3.0

[Indexing Server]

  1. Install the ZIP IFilter (see below for a list of available IFilters)
  2. Add the .zip file type to the index list:
    1. Open the Registry Editor (Start > Run > regedit)
    2. Go to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Applications\<GUID>\Gather\Search\Extensions\ExtensionList
    3. Add a new String Value
      1. Value name: <next value in line>
      2. Value data: zip
  3. Perform an iisreset
  4. Perform a Full Update on the Search content indexes
    1. Open a Command Prompt on the Indexing Server
    2. net stop spsearch
    3. net start spsearch
    4. cd "C:\Program Files\Common Files\Microsoft Shared\Web server extensions\12\BIN"
    5. stsadm.exe –o spsearch -action fullcrawlstop
    6. stsadm.exe –o spsearch -action fullcrawlstart

[Web Front End Server]

The zip icon registration is available out of the box.

Microsoft Office SharePoint Server 2007

[Indexing Server]

  1. Install the ZIP IFilter (see below for a list of available IFilters)
  2. Add the .zip file type to the index list:
    1. Go to Central Administration, then to the Shared Services Administration Web of the current SSP, go to Search Settings and next to File Type
    2. Add a new file type zip
  3. Perform an iisreset
  4. Perform a Full Update on the Search content indexes
    1. Open a Command Prompt on the Indexing Server
    2. net stop osearch
    3. net start osearch
    4. Go to Central Administration, then to the Shared Services Administration Web of the current SSP, go to Search Settings and start a full crawl of all locations containing ZIP files

[Web Front End Server]

The zip icon registration is available out of the box.

Available IFilters

IFilterShop ZIP IFilter

  • requires a license
  • 32 bit and 64 bit (applies to the [Indexing Server])
  • Note: I haven't gotten this one to work. After installation and configuration I'm receiving the following for all crawled ZIP items: Crawled (The filtering process could not load the item. This is possibly caused by an unrecognized item format or item corruption. )

Citeknet ZIP IFilter

  • requires a license
  • 32 bit and 64 bit (applies to the [Indexing Server])
  • Currently version 2.1 Beta
  • Works very nice in the test setup. Haven't seen it in production or stress tests.

What about PDF documents inside ZIP archives ?

The ZIP IFilter will index all files in the archive using a corresponding IFilter, but if yours is an appartment threaded IFilter (such as Adobe's PDF IFilter) you need to make the following adjustment:

[Indexing Server]

  1. Open the Registry Editor (Start > Run > regedit)
  2. Go to HKEY_CLASSES_ROOT\CLSID\{4C904448-74A9-11d0-AF6E-00C04FD8DC02}\InprocServer32
  3. Change the ThreadingModel key value
    1. Old value: Apartment
    2. New value: Both
  4. Go to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\ContentIndex
  5. Change the DLLsToRegister key value
    1. Remove the entry corresponding to pdffilt.dll from the list to prevent the Adobe PDF IFilter from re-registering
  6. Restart the Search Service and perform a Full Update

An excellent tool to get an overview of installed IFilters is Citeknet IFilter Explorer which will also show you the threading model.

Conclusion

Using the above procedure for either WSS 3.0 or MOSS 2007 it is possible to have your ZIP archives indexed by the SharePoint Search. The IFilter will recursively index all containing ZIP archives. Any other files (.txt, .doc, .ppt, .pdf) are indexed and if an IFilter for that file type exists it will be used to extract information from it. This way it can index text inside PDF documents inside the ZIP archive.

Note that the search results will show confusing file names as shown below:

 


Links to this post

Comments

Monday, 1 Dec 2008 09:04 by confusing file names
Hi, Why do you get the "confusing file names" that start with flt ? I have the same issue with word and other documents on a new farm with the Infrastructure update. I customised the search results to display the filename rather than the title (Required for a number of reasons). I used the following article as guidance https://forums.microsoft.com/TechNet/ShowPost.aspx?PageIndex=0&SiteID=17&PageID=0&PostID=2408720 . The other smaller farm running without the update is fine with the same content. Thanks

Thursday, 18 Dec 2008 04:45 by Terence
Hi, I would like to ask does those IFilter support building index on a nested zip file???? Thanks.

Friday, 19 Dec 2008 02:48 by Steven Van de Craen
Terence, it does work recursively on nested ZIP files.

Thursday, 8 Jan 2009 01:27 by Samantha Johnson
You say that "The ZIP IFilter will index all files in the archive using a corresponding IFilter". I have installed the Microsoft Filter Pack and modified the registry as they instruct. When I search for a string that I know exists inside a .doc file in a .zip file the search results return nothing. Any idea on why this would be happening?

Friday, 9 Jan 2009 08:33 by Steven Van de Craen
Does it work for plain text files in the zip? Perhaps the registration went wrong or the file wasn't (re)crawled ?

Monday, 2 Feb 2009 10:52 by AB
Will the zip ifilter work on chm (Compiled HTML) files? I've seen third party chm filters but seems 2003 indexed them by default.

Friday, 6 Feb 2009 01:42 by Tory Douglas
Steven. I was wondering if there is any other special configuration for the zip ifilter. I installed on my index server and wasn't able to get it to search content inside the zip file. I also then tried microsoft filter pack and made registry changes associated with that zip filter and still no success. I am pulling my hair out. Thanks

Sunday, 8 Feb 2009 09:32 by Steven Van de Craen
@AB, are CHM based on ZIP format ? If so then it might be. Then you'd need to add CHM to the Crawled File Extensions list. I can't open a CHM file with WinRAR so my first guess is no but then I could be way of and haven't tested it

Sunday, 8 Feb 2009 09:34 by Steven Van de Craen
@Tory, do you have the issue with both existing and new ZIP files ? (new being files you added after the iFilter registration) What setup do you have ? MOSS ? WSS ? Check the crawl log for a sample ZIP file (you can filter by location) and see what error it gives.

Tuesday, 7 Apr 2009 02:03 by david latham
all, i'm having the same issue. i'm running WSS, installed the filter pack, modified the registry, etc. i crawl zip archive and log says success. search for a term in text file in zip: notta, nothing. only the filename is indexed. help!

Friday, 29 May 2009 03:17 by William Langenhuizen
Same here, only the filename of the doc file inside the zip file is indexed. Does anyone know how to fix this? I am using MOSS 2007 SP2 with the Microsoft Filter Pack.

Wednesday, 8 Jul 2009 03:57 by Mubeen
Same issue MOSS with MSFT Filter Pack. Zips and Office 2007 files not getting returned in results

Thursday, 7 Jan 2010 10:09 by Jerome
Encountered the same issue also with the MS Filter pack. Only the file name in the zip file is indexed, not the content.

Saturday, 8 May 2010 01:41 by Jeff Hall
I have been succesful with this configuration on a 32-bit SharePoint Farm. However, I have been unable to get this to work for the Adobe 9.0 IFilter (64-bit). Do have any insight on that?

Wednesday, 12 May 2010 11:55 by Steven Van de Craen
Check with Citeknet's iFilter Explorer if the iFilter is installed correctly and mapped to PDF for SharePoint Search. You can also find the correct GUID there to put in the registry. Might be different for the 9.0 iFilter, not sure.

Monday, 17 May 2010 01:56 by Bianco Veigel
The Microsoft ZIP IFilter is a little buggy. It doesn't report the correct number of bytes returned from IFilter::GetText() so there will be a null-Byte after the Filename. Since strings are normally null-Terminated, the Indexing will stop after the Filename. Here you'll find a working ZIP-IFilter: http://gallery.live.com/liveItemDetail.aspx?li=722314ea-aae4-4c56-9132-28b6ab8e144f

Friday, 16 Jul 2010 07:20 by Dorian Grech
Lets say you have multiple site collections but you only want to index zip files in one of them. Is there a way to just index zip files on that site collection?

Monday, 19 Jul 2010 08:10 by Steven Van de Craen
Dorian, You might try a Crawl Rule with wildcard to include zip (eg http://sitecollection/*.zip) and exclude almost everything else ?

Thursday, 9 Sep 2010 07:02 by pryank rohilla
Hi, I am using MS IFilter for zip files on MOSS 2007(SP2) search. Search crawler is able to crawl the zip files. But if zip file contains ".xls" search crawls the zip file, but its not able to return any search results for content search for .xls file. I don't see any issue in crawllog. For content in .pdf, .xlxs, .doc files inside zip file, Search is able to crawl and show search results.

Wednesday, 27 Oct 2010 09:10 by Joe paul
@ pryank can u please try the below url http://support.microsoft.com/?id=946336

CAPTCHA Image Validation