Friday, January 23, 2009

at 3:06 AM Posted by Visuals India

This module performs a scan of the specified site and uses the information it gathers to reveal the following issues:

- finds broken/non-working links;
- finds broken/missing images;
- finds "lost" or "orphaned" files;
- finds errors and bugs in html code;
- checks the Google Page Rank value for every page;
- creates a detailed report on all external links from the site;
- creates site maps formatted as html pages;
- creates an XML sitemap in a format that can be submitted to the Google search engine;
- creates and edits the robots.txt file as used by search engines.




Project settings

Site address – specifies the address of the site to be analyzed. The site pages will be downloaded and then analyzed. You should specify the site root (domain) here. The remaining site pages will be found by following links.

Maximum page size (kb) – specify the maximum page size that should be downloaded and analyzed. Pages that are larger than this will be ignored. This parameter allows you to exclude html pages that are too large and also avoids problems when the server identifies a binary file (such as a zip archive) as an ordinary web page. These files can be very large, which leads to excess traffic and wastes time. Specify zero size if you wish to ignore page sizes.

Number of search threads – specifies how many pages will be downloaded simultaneously.

Encoding – sets site encoding. In most cases the analyzer can detect the site encoding automatically. If this doesn't work you can specify the site encoding manually.

Check starting page only – allows scanning and checking of the specified web page. Links to internal site pages are not followed.

Check external links – if this option is selected, the program will check the validity of all external links to other sites as well as links to internal pages on the specified site.

The Analyze and Stop buttons start and stop the site analysis process.

Files – lists all site pages that were found and analyzed during the scan. There is a color coded indicator next to each page address. Here is what the colors signify:

White – the page is queued ready for downloading,
Yellow – the program has the page marked for downloading and analysis,
Orange – the page is downloaded, but not yet analyzed,
Red – the page was downloaded but an error occurred,
Purple – the page was queued but the user interrupted the analysis,
Green – the page was downloaded and successfully analyzed.

Under this page list you will see the following totals: pages found, analyzed pages, error pages and pages queued ready for downloading.

The Refresh list option enables or disables periodic automatic refreshes of the above page list.

HTML settings – as well as link validity checks the program can also find faults in html code. Here is a summary of the HTML checks made by Site Analyzer:

TITLE – checks the page for the presence of a TITLE tag.
Meta Description – checks the page for the presence of a Description Meta tag.
Meta Keywords – checks the page for the presence of a Keywords Meta tag.
Image Alt – checks each image on the page for the presence of an alt tag.

If those options are selected, information on relevant missing tags will be included in the report.

Signature of 404 page – in the "classic" server configuration the server returns a "404 – the page is unavailable" response if an attempt to access a missing page is made. Thus, the client (usually a web browser) gets information that the required page doesn't exist. However, in many cases the server is configured so that no error message occurs when a nonexistent page is requested. Instead the user is served an alternative custom website page, usually containing an apology that the page could not be found (we'll call it "user page 404"). The server response in this case is "200 - OK" because a valid page was served. This means that it is impossible to automatically determine that the originally requested page doesn't exist. In this case you should specify a html code fragment that only exists in your "user page 404" so that the program can properly detect faulty links.

FTP settings and Missed files. Lost files are pages that are present in the site catalog, but are not linked from any other site pages. Site Analyzer can find such files if there is FTP access to the site. In this case the program compares the list of pages found by the http scanner (when the site was scanned) with the list of pages in the ftp directory of the site. If extra html files are found in the ftp directory they will be indicated as "lost" in the report.

Find lost files – displays information about lost files in the report. Because this feature requires FTP site access you need to specify FTP settings – address, user name, password, root directory and FTP port.

Index filename – each site has a so-called index file. This is an html page that is shown when you type the root address of the site (domain name) and it usually has the name index.html. The name of this page isn't usually included in the html code of the site (because "site.com" and "site.com/index.html" have the same meaning, the shorter variant "site.com" is generally used). Because of this the index page will be shown in the lost pages list. Specify the name of the index page here to suppress this display.

The Scan FTP and Stop buttons allow scanning of the site FTP directory. This scan is performed automatically during the main site scan if the Find lost files option is selected but can be performed separately using these buttons.

File extensions – specifies the list of extensions for pages that will be detected when the program looks for lost files.

Include/Exclude paths. Allows you to specify which site sections should or should not be analyzed. This is useful if you need to skip some site sections during analysis. For example, you could scan just the main site web pages, skipping the forum and the guest book.

The "Include paths" mode – in this mode the program will only analyze the pages corresponding to a specified template. For example, to check a site section called www.site.com/reports/ you should type /reports. All other site pages will be ignored.

The "Exclude paths" mode – the program will analyze all site pages except the specified ones. For example, to avoid analysis of a site forum you could type the path /forum.php?. In this case the program will analyze all pages except the forum pages.

The Save and Load buttons are used to save/load the include and exclude path lists.

Note that the internal paths should always start with the "/" character. If a path is specified without this symbol, Site Analyzer assumes that a sub domain of the main domain of the site is specified.

The Scan subdomains option makes the program check subdomains of the main domain; it will treat them as internal pages.

Note: Internal paths must always begin with the "/" symbol.

Report

The information from a site scan is presented in three views: a summary table, a final report and a detailed report.

A summary table. This table includes all the site's pages. The following parameters are displayed for each of them:

Google PageRank
Page size
Server response code
Last modification date (if provided by the server, otherwise the scan date is shown)
Status indicator. The indicator is shown as a circle with the following colors:
- Green – there are no errors or faults on the page
- Yellow – there are no errors on the page, but some tags are missing
- Red – there are non-valid links and/or images
- Orange signifies a lost page

The table can be sorted on any of the columns and exported in various formats including HTML, Excel, and Word.

Summary report. This contains summary information about the site.

- Total number of pages
- Number of faulty pages with non-working links and images
- number of pages with missing tags (faulty pages)
- Number of pages without any bugs
- Number of lost pages
- Total size of all pages
- Number of pages grouped by Google PageRank, total PageRank and average PageRank per page
- Information about external links from the site. This section can be displayed in several formats:
1. Extended mode. All external links are listed for each site page
2. Shortened mode. A list of all external links from the whole site is shown without specifying the pages
3. Showing link anchor. The text (anchor) of the link is shown next to the each link address

Detailed report contains information about each page including error information and a list of external links.

The Open in Browser button opens the report in your browser.

HTML sitemap

The module that creates the site html map consists of two main sections – the Editing area and the Preview area.

The editing area contains all the site pages detected by the scan. Each page can be included/excluded from the sitemap and can be completely removed from the list of available pages with the Del key. Use the Ctrl, Right, Left, Up and Down keys to achieve your required sitemap structure then click the Refresh html button to refresh the map preview area.

The program can create two types of view – horizontal (one column) and vertical (several columns).

When you create a site map, you should select the following parameters:

Tree depth – if the depth is set to 1, the sitemap will include only first level pages linked directly from the main page. If the depth is set to 2, it will also include second level pages linked directly from first level pages and so on… If you set the depth to 0, the sitemap will include all pages without exceptions.

Number of columns (for vertical maps only) A vertical sitemap is presented in columnar format; this setting controls the number of columns.

Scheme – the program can create maps in several color schemes to allow the site map appearance to match the design of your website.

While the program is working, it stores two copies of the sitemap. The first one is the result obtained automatically after the site is scanned. The second one is the current sitemap with your manual modifications. At any time you can swap them using the Default sitemap and User sitemap buttons.

Add node allows you to add a new node (page, address) to the sitemap. You can add new nodes to horizontal or vertical maps.

The Expand and Collapse buttons expand and collapse all nested nodes of the displayed map.

The Refresh html button refreshes the displayed site map in the preview area.

Save allows the map to be saved to a specified location ready to be uploaded to your site.

Open in browser opens the current version of the sitemap in your web browser.


XML sitemap

In this section you can create an XML sitemap of your website. XML maps are not intended to be read by humans but can be submitted to search engines to allow them to crawl your website more effectively.

Sitemap.
Create new sitemap (blank) – creates a blank XML starting sitemap.
Create sitemap by the crawler data – creates a sitemap using data from the http scanner as obtained during a general site analysis.

Open sitemap from file – opens the XML sitemap from a specified xml file.
Add a page from the text file– adds new addresses to an existing map from a text file.

Save sitemap to file – saves the sitemap to a specified XML file.
Upload to FTP – uploads the created site map to the ftp catalog of your site. From then on the map becomes available to search engines. As well as uploading the sitemap, you can also inform Google that the sitemap has been refreshed.

Edit items.
Edit selected items – allows setting parameters of the page for several selected addresses.
Add new item – adds a new item to the XML sitemap.
Delete selected items – deletes addresses from the sitemap.

Note: Any address parameters can be edited in the displayed table. You can also press the Del key on any of the parameters to delete them from the map.

Crawler.
Update sitemap by crawler data – adds any new detected pages to the sitemap and refreshes the Last modified parameter for existing pages.
Add new pages by crawler data – only adds new detected addresses to the map without changing existing address data.
Update Last modified for selected items – refreshes the Last Modified parameter for selected pages.

The date of last scan is displayed under the buttons described above.

Options.
Allows changes to the default settings that are used added addresses. There are several ways to specify dates:

- specify current date for all pages,
- use a selected date,
- use the date of the most recent html file edit (if the server provides this),
- You can also omit this parameter.

If you attempt to delete a group of addresses from the map you will, by default, be asked if you are sure. The Delete records with request option allows this query to be disabled.

Open in notepad.
This causes the notepad application to be launched with the XML file opened. This may be useful if you need to edit the raw XML sitemap data.

Last scan date is. The date of the most recent Site Analyzer scan is shown.


Robots.txt editor

In this section you can create a robots.txt file that is used by search engines for site indexing.

Robots list.
Select which robots should be included in the file.

Select the required robot and click the Add button – the ftp catalog of the site will be uploaded and you will be able to select any pages that should not be indexed.

The Clear all button removes all indexing information from the file (actually it creates a blank robots.txt file). No indexing robots will be forbidden.

The Clear button deletes index information only for the selected robot. This indexing robot will no longer be forbidden.

The Save button saves the result to a file.

The Upload button uploads the file to the server.

Selected robot.
This section displays a section of the robots.txt for the current selected robot.

Robots.txt.
This section displays all of the resulting file robots.txt.

The Use sitemap option allows you to add a line to the robots.txt file that informs the search engines about the presence of an xml sitemap.

0 comments: