|
Searcharoo "2007" (Medium Trust and Office 2007 indexing)Download source code - 286 Kb
BackgroundThis article follows on from the previous four Searcharoo samples: Searcharoo 1 describes building a simple search engine that crawls the file system. A basic design and object model was developed to support simple, single-word searches, whose results were displayed ina rudimentary query/results page. Searcharoo 2 focused on adding a 'spider' to find data to index by following web links (downloading files via HTTP and parsing the HTML). Also discusses how multiple search words results are combined into a single set of 'matches'. Searcharoo 3 implemented a 'save to disk' function for the catalog, so it could be reloaded across IIS application restarts without having to be generated each time. It also spidered FRAMESETs and added Stop words, Go words and Stemming to the indexer. Searcharoo 4 added IFilter support for non-text filetypes (eg Word, PDF and Powerpoint), better robots.txt support, a remote-indexing console application and a lot of code tidy-up (refactoring!). Introduction to version 5This article is shorter than most, covering just two topics:
ASP.NET has 'Trust Issues'When Searcharoo v4 is run under Medium Trust, you get one of these errors:
WebPermission denied if [SecurityException: Request for the permission of type 'System.Net.WebPermission, System, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089' failed.] System.Security.CodeAccessSecurityEngine.Check(Object demand, StackCrawlMark& stackMark, Boolean isPermSet) +0 System.Security.CodeAccessPermission.Demand() +59 System.Net.HttpWebRequest..ctor(Uri uri, ServicePoint servicePoint) +166 System.Net.HttpRequestCreator.Create(Uri Uri) +26 System.Net.WebRequest.Create(Uri requestUri, Boolean useUriBase) +373 System.Net.WebRequest.Create(String requestUriString) +81 Searcharoo.Indexer.RobotsTxt..ctor(Uri startPageUri, String userAgent) +250 Searcharoo.Indexer.Spider.BuildCatalog(Uri startPageUri) +116
SecurityPermission denied if [SecurityException: Request for the permission of type 'System.Security.Permissions.SecurityPermission, mscorlib, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089' failed.] System.Runtime.Serialization.Formatters.Binary.ObjectReader.CheckSecurity(ParseRecord pr) +1644388 System.Runtime.Serialization.Formatters.Binary.ObjectReader.ParseObject(ParseRecord pr) +363 System.Runtime.Serialization.Formatters.Binary.ObjectReader.Parse(ParseRecord pr) +64 System.Runtime.Serialization.Formatters.Binary.__BinaryParser.ReadObjectWithMapTyped(BinaryObjectWithMapTyped record) +1050 System.Runtime.Serialization.Formatters.Binary.__BinaryParser.ReadObjectWithMapTyped(BinaryHeaderEnum binaryHeaderEnum) +62 System.Runtime.Serialization.Formatters.Binary.__BinaryParser.Run() +144 System.Runtime.Serialization.Formatters.Binary.ObjectReader.Deserialize(HeaderHandler handler, __BinaryParser serParser, Boolean fCheck, Boolean isCrossAppDomain, IMethodCallMessage methodCallMessage) +183 System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Deserialize(Stream serializationStream, HeaderHandler handler, Boolean fCheck, Boolean isCrossAppDomain, IMethodCallMessage methodCallMessage) +190 System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Deserialize(Stream serializationStream) +12 Searcharoo.Common.Catalog.Load() +461 The combination of errors -- cannot create a new catalog, and cannot load an existing catalog file (even if it was generated elsewhere) -- means that Searcharoo v4 doesn't work under Medium Trust. There are two options to fixing this problem:
We'll do #2, since it's easier! There was a long discussion in v4 about why Binary Serialization was a good idea and Xml Serialization was bad: in this article we'll turn that around by fixing the problems with the Xml output so that we can build it remotely using the Indexer Console Application then uploaded to a Medium Trust website. Xml-serialized data can be de-serialized even under Medium Trust, so it can be loaded and searched. About Option #2: Xml redux
Original (v4) Xml Catalog format
Recall that each
The problem with the resulting Xml is that the
What's needed is a more succinct way to represent the relationship between
New (v5) Xml Format
Now the
Behind the Xml-serialization Scenes
So that's the Xml format we need - how do we get it?
Unfortunately, just replacing the
The two property declarations look like this (below): the
The
If you check the Catalog c1 = Kelvin<Catalog>.FromXmlFile(xmlFileName);
One final note: rather than remove the Binary Serialization feature, both methods
are still available, controlled by a new <appSettings> <add key="Searcharoo_InMediumTrust" value="True" /> </appSettings>If set to True, the Catalog will be saved as an Xml file, if set to False
it will be written as *.dat. Don't forget to update the other .config file settings to match
your environment - including the Searcharoo_VirtualRoot, Searcharoo_CatalogFilepath
and Searcharoo_TempFilepath which will be used in the DownloadDocument
class discussed in the remainder of this article...
More on Trust & Code Access Security
Office 2007 File FormatsThe rest of the article discusses indexing the new Office 2007 file formats.
Microsoft Word Docx file 'structure'
A Microsoft Word 2007 file looks like this 'inside' the ZIP:
Step 1: Subclassing Document to share download code
The v4 article
describes how the
Step 2: unZIP
The
Using the using (ZipFile zip = ZipFile.Read(filename))
{
using (MemoryStream streamroot = new MemoryStream())
{
MemoryStream stream = new MemoryStream();
zip.Extract(@"word/document.xml", streamroot);
stream.Seek(0, SeekOrigin.Begin); // important!
// TODO: code here to process Xml from the stream
}
}
Step 3: Extract text
Turns out the Word 2007 OpenXML format is very Html-like in it's treatment of
formatting and content: all document structure and formatting present in
DocxDocument in 3 easy steps
The new Docx file indexer inherits most of it's functionality from the abstract
This same pattern can be easily applied to PowerPoint 2007 (
Lastly, our new classes will never be instantiated unless we update
More on Open XML Office FormatsErika's blog is an excellent source of Office 2003 and 2007: MSDN Technical Articles, How-To Content, and Code Samples. Other links include:
Wrap-upThese additions to Searcharoo are quite minor, and have been posted mainly to help anyone
wishing to use the code under Medium Trust. Many users may have Office 2007 installed
(or the relevent |