IntroductionIntroduction
  InstallingInstalling
  HandlingHandling
  Virtual serversVirtual servers
  ModulesModules
  FilesystemsFilesystems
  RXML tagsRXML tags
  GraphicsGraphics
  ProxyProxy
  Miscellaneous modulesMiscellaneous modules
  Security considerationsSecurity considerations
  ScriptingScripting
  DatabasesDatabases
  LDAPLDAP
  IntraSeekIntraSeek
    <Directories>Directories<Directories>Directories
    <Configuring>Configuring<Configuring>Configuring
    <Creating new profile>Creating new profile<Creating new profile>Creating new profile
    <Indexing>Indexing<Indexing>Indexing
    <Languages>Languages<Languages>Languages
    <Logs>Logs<Logs>Logs
    <Advanced profile>Advanced profile<Advanced profile>Advanced profile
    <Technical document>Technical document<Technical document>Technical document
  LogViewLogView
  FrontPageFrontPage
  UpgradingUpgrading
  Third party extensionsThird party extensions
  PortabilityPortability
  Reporting bugsReporting bugs
  AppendixAppendix
 
Technical document

This a short introduction to some of the technical aspects to IntraSeek If you are not interested in the technical questions of the inner workings of IntraSeek you can comfortable disregard this chapter.

Storage of databases
IntraSeek currently uses Yabu as the data base handler. Also note that no logic is placed in the database, all the boolean search operations, tree structure calculations and wildcard matching is done by IntraSeek. The database is only used in a basic way, just to store data, and to retrieve it.

When collecting new information the crawler uses a separate data base. The reason is that it should be possible for the web users to search even while the crawler collects new information. A flag file is written to tell the search engine to swap the new data base in when it is finished, overwriting the old one.

Memory usage
The memory used by Pike and IntraSeek varies a great deal, depending on your operating system.

However, the more pages IntraSeek collects, the more memory it uses, as it keeps the site structure, its errors and two stacks in memory, which pages to visit, and which have been visited.

The index of words is also kept in memory, but written to disc at certain intervals, called safety saves. A safety save dumps the index to a yabu data base, then clears the memory it used. Also, the disc data bases are reorganized to keep them down in size. When a reorganization is running, you can notice files that start and end with the "#" in the temporary storage directory.

By default, these saves occur every 500 pages. You can lower this if you run into memory problems. However, you shouldn't increase this, if you do, you will get a structure data base that grows faster and faster, and the crawler will consume more and more memory.

If you are not interested in statistics for the site, the log with broken links and such, you can disable this feature to gain memory. To do this, go to the profile configuration and change the variable Site structure logs to no.

To limit the size of the index both in memory and on disc, stop lists are supported. A stop list contains short, "meaningless" words that are filtered out. For example, in the English stop list words like "the", "and" and "it" appear. Use one or several stop lists covering the language you usually use on the pages you run the crawler through.

Max download per document defines, in characters, how much of a document that should be downloaded. This is used to limit index size. Normally, if you download the 100000 first characters of a document, it is very likely that there are enough terms to cover the content of the document. This will of course vary depending on the type of information present on your site. If all words are important (even those at the end of large documents) you should increase this value to higher values, for instance, 999999.

Log files are stored on disc and also take up some memory. Make sure to delete them every now and then by using the configuration interface.

Rejects and accepts
Reject and accept patterns are unique for every profile. By default, the reject pattern comes with a lot of standard avoids, mainly to reject files ending with .gif, .gz and so on. The Accept pattern is by default empty.

The reject and accept rules are applied when IntraSeek is about to schedule a new URL for visit.

  1. First, IntraSeek matches the reject patterns. If any of them match, IntraSeek will not visit the URL, and no further match-checks are made on this URL.

  2. After that, the accept patterns are matched. If any match is found, the URL will be accepted, and no further match-checks are made on this URL.

  3. Finally, if the URL was neither rejected or accepted, IntraSeek will reject the URL by default.