|
Image/Location Index/Search(aroo)Download source code - 317 Kb
BackgroundThis article follows on from the previous five Searcharoo samples: Searcharoo 1 was a simple search engine that crawled the file system. Very rough. Searcharoo 2 added a 'spider' to index web links and then search for multiple words. Searcharoo 3 saved the catalog to reload as required; spidered FRAMESETs and added Stop words, Go words and Stemming. Searcharoo 4 added non-text filetypes (eg Word, PDF and Powerpoint), better robots.txt support and a remote-indexing console app. Searcharoo
5 runs in Medium Trust and refactored Introduction to version 6The following additions have been made:
NOTE: This version of Searcharoo only displays the location (Latitude, Longitude) data - it doesn't "search by location". To read how to search "nearby" using location data, see the Store Locator: Help customers find you with Google Maps article. Image Indexing (reading Jpeg metadata)
In Searcharoo
5 the
The simple changes to the object model are shown above - the actual The two key parts of the code are shown below - the first uses System.Drawing.Imaging and EXIFextractorThe EXIF data is stored in a 'binary' format - opening a tagged image in Notepad2 shows recognisable data with binary markers:
The binary structures are recognised by the .NET using System.Collections; // DictionaryEntry
using System.Drawing.Imaging;
// ...
public static PropertyItem[] GetExifProperties(string fileName)
{
using (FileStream stream =
new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
using (System.Drawing.Image image =
System.Drawing.Image.FromStream(stream,
/* useEmbeddedColorManagement = */ true,
/* validateImageData = */ false))
{
return image.PropertyItems;
}
}
}
Searcharoo uses the EXIFextractor code so that it's easy to review and change - there is also an alternative method to access the EXIF data by incorporating the Exiv2.dll library into your code, but I'll leave that up to you. XMP via XML islandUnlike the EXIF data, XMP is basically 'human readable' within the JPG file as you can see below - there is an 'island' of pure XML in the binary image data. Importantly, XMP is the only way to get the Title and Description information which is really useful for searching.
public static string GetXmpXmlDocFromImage(string filename)
{
using (System.IO.StreamReader sr = new System.IO.StreamReader(filename))
{
contents = sr.ReadToEnd();
sr.Close();
}
beginPos = contents.IndexOf("<rdf:RDF", 0);
endPos = contents.IndexOf("</rdf:RDF>", 0);
// ... then get title
xmlNode = doc.SelectSingleNode(
"/rdf:RDF/rdf:Description/dc:title/rdf:Alt", NamespaceManager);
// ... then get description
xmlNode = doc.SelectSingleNode(
"/rdf:RDF/rdf:Description/dc:description/rdf:Alt", NamespaceManager);
// ... etc.
It's worthwhile noting that if you are working with .NET 3.0 or later, extracting XMP metadata is much more 'scientific' than the above example - using WIC - Windows Imaging Component to access photo metadata. In order to keep this version of Searcharoo compatible with .NET 2.0, utilizing those newer features has been avoided (for now). If you want to be able to add metadata to your images, try iTag: photo tagging software (recommended). iTag was used to tag many of the photos used during testing. Indexing Geographic Location (Latitude/Longitude)
Image metadata & Html Meta tags
<meta name="ICBM" content="50.167958, -97.133185"> <meta name="geo.position" content="50.167958;-97.133185">
The code currently parses out description, keyword and robot META tags using the following
Regular Expression/
Adding support for ICBM and
geo.position
merely required a couple of additional
Once we have the longitude and latitude for an Html or Jpeg, it is set in the base Additional metadata: File type and Keywords (tags)
Since we needed to add properties to the base Only Html and Jpeg classes currently support Keyword parsing, but almost all the subclasses correctly set the Extension property that indicates file-type. These additional pieces of information will provide more feedback to the user when viewing results, and in future may be used for: (a) alternate search result navigation (eg. a tag cloud) and/or (b) changes to the ranking algorithm when a keyword is 'matched'. "No" trust Catalog accessThis 'problem' continues on from the "Medium" Trust issue discussed in Searcharoo 5 so it might be worthwhile reading that article again.
Basically, the new problem is that NOT EVEN WebClient permission is allowed, so the The steps required are:
System.Reflection.Assembly a =
System.Reflection.Assembly.Load("WebAppCatalogResource");
string[] resNames = a.GetManifestResourceNames();
Catalog c2 = Kelvin<Catalog>.FromResource(a, resNames[0]);
One final note: rather than remove the Binary or Xml Serialization features that run in "Full" and "Medium" Trust,
all methods are still available. Whether Binary or Xml is controlled by the <appSettings> <add key="Searcharoo_InMediumTrust" value="True" /> </appSettings>If set to True, the Catalog will be saved as a *.XML file, if set to False
it will be written as *.DAT. Only if the code cannot load EITHER of these files will the resource DLL
be used (an easy way to force it would be to delete all .DAT and .XML catalog files).
Presenting the newly indexed data
The aforementioned changes mainly focus on the addition of indexing functionality: finding new data
(latitude, longitude, keyword, file-type), cataloging it and allowing the Catalog to be accessed. Presenting
the search results with this additional data required some changes to the
The new properties added to
SearchKml.aspx
The presence of location data (latitude and longitude coordinates) doesn't just allow us to 'link' out to
a single location - it allows for a whole new way to view the search results! When one or more result links are
found to have location data, a new view in Google Earth link is displayed. It links to the
However, the link to SearchKml.aspx is formatted more like a 'file reference': for example searcharoo.net/SearchKml/newyork.kml takes you to the screenshots below:
How (and why) does the link searcharoo.net/SearchKml/newyork.kml
use the The reason for the url format (using a .kml extension and embedding the search term) is to enable browsers to open Google Earth based on the file extension (when Google Earth is installed, it registers .KML and .KMZ as known file types). Because the link "looks" like it refers to a newyork.kml file (and not an ASPX page), most browsers/operating systems will automatically open it in Google Earth (or other program registered for that file type). The link syntax is accomplished with a 404 custom error handler (which must also be setup in either web.config or IIS Custom Errors tab):
NOTE: the same 'behaviour' is possible using a custom HttpHandler, however it requires "mapping" the .KML extension to the .NET Framework in IIS - something that isn't always possible on cheaper hosting providers [which will usually still allow you to setup a custom 404 URL]. It would be even easier using the new URL Routing Framework being introduced in .NET 3.5 SP1. For now, Searcharoo uses the simplest approach - 404.aspx. Minor enhancementsColor-coded Indexer.EXE
This is purely a cosmetic change to make using the Searcharoo.Indexer.EXE easier; and uses
Philip Fitzsimons' article
Putting colour/color to work on the console.
Each different "logging level" (or 'verbosity') is output in a different color to make reading easier.
Multiple start Urls
Previous versions of Searcharoo only allowed a single 'start Url', and any links away from that Url were ignored.
Version 6 now allows multiple 'start Urls' to be specified - they will all be indexed and added to the same
WARNING: indexing takes time and uses network bandwidth - DON'T index lots of sites without
being aware of how long it will take. If you stop indexing half-way-through, the Recognising 'fully qualified' local links
A 'bug' that was reported by a few people (without resolution) has been addressed - if your site has "fully qualified" links
(eg. my blog conceptdev.blogspot.com
has ALL the anchor tags specified like this "http://conceptdev.blogspot.com/2007/08/latlong-to-pixel-conversion-for.html")
these links were marked as "External" and not crawled (
NOTE: the code does a 'starts with' comparison, so if you specified a subdirectory for your "Start Url" (eg. http://searcharoo.net/SearcharooV1/) then a fully qualified link to a different subdirectory will STILL not be indexed (eg. lt;a href="http://searcharoo.net/SearcharooV2/SearcharooSpider_alpha.html"> would NOT be followed). Bug fix (honor roll)Many thanks to the following CodeProject readers/contributors: <TITLE> tag parsingErick Brown [work] identified the problem of CRLFs in the <TITLE> tag causing it to not be indexed... and provided a new Regex to fix it.
Correctly identifying external site links
mike-j-g (and later hitman17)
correctly pointed out that the matching of links (in Handling escaped & ampersands in the querystringThanks to Erick Brown [work] again for highlighting a problem (and providing a fix) for badly handled & ampersands in querystrings. Proxy support
stephenlane80 provided code for downloading via
a proxy server (his change was added to Parsing robots.txt from Unix servers
maaguirr suggested a change to the Try it outIn order to 'try out' Searcharoo without having to download and set-up the code, there is now a set of test files on searcharoo.net which you can search here. The test files are an assortment of purpose-written files (eg. to test Frames, IFrames, META tags, etc) plus some geotagged photos from various holidays. Wrap-up
Obviously the biggest change in this version is the ability to 'index' images using the metadata
available in the JPG format. Other metadata (keywords, filetype) has also been added, and a foundation
created to index and store even more information if you wish. |