|
Version 6
|
|
Index JPG images, index GPS location data for mapping results, address "No" Trust problem and fix a few bugs.
NEW! June '08
|
Version 5
|
|
Remove Binary Serialization to solve Medium Trust problem; index OpenXML document formats.
|
Version 4
|
|
Refactored codebase and ability to index and search Microsoft Word,
Excel, PowerPoint and Acrobat PDFs. Little improvements like robots.txt
and excluding regions of HTML also added.
|
Version 3
|
|
Adds a "save to disk" for the catalog; feature suggestions,
bug fixes and incorporation of code contributed by others
from previous versions.
|
Version 2
|
|
Extend Searcharoo to populate its search
catalog by Spidering HTML pages - follow links and imagemaps
to process both static and dynamicly generated pages!
You can also search for multiple words.
|
Version 1
|
|
How to build a simple, extensible search engine using ASP.NET that
can crawl files and create a searchable catalog by processing the
text from HTML source.
|
|
|
|
|
Web search technology is a huge subject, encompassing:
- networking (spidering the web),
- string and markup-language manipulation (parsing HTML)
- proprietary file formats (searching Word, Excel, PDF, etc)
- language and text-parsing (finding words & sentences in documents, stemming and other
linguistic analysis),
- algorithms (finding matches, AND/OR queries, combining multiple word results)
- performance (both increasing spidering speed, and making large catalogs fast to search)
- user interface (presenting search input options, and results)
Searcharoo.net hardly touches the surface on
any of these topics :-) but it does attempt to introduce them with an open-source
C# implementation of a search engine that you can download and use on your website.
The default interface should be familiar (and is easily customizable).
The articles describe how the engine itself is built, from a simple file-system crawler to
a fully-fledged web-spider. You can comment or ask questions on CodeProject.
In addition to information on this website, these search-related links
might be interesting/useful.
|
|
|