Solutions

How do I configure Director to crawl a website, using the Content Sync Module?

Solutions ID:    KB4482
Version:    8.0
Status:    Published
Published date:    07/13/2011
Updated:    02/09/2012
 

Problem Description

I have already installed the Content Sync Module (CSM) of Director and would now like details on how to create a job. 

What is the Content Sync Module used for in Director deployments?

Can you crawl a website to pre-populate the cache on  your SG appliances?

Can you pre-populate content on you SG appliances?

What type of content can be pre-populated using the CSM feature of Director?

Resolution

The Content Sync Module of Director (CSM) can then be invoked and configured to create a list of HTTP or  CIFS objects and folders for pre-population. After the intial scan, CSM can be used to rescan for updates to the content. Once the content list is created, it can be uploaded, (either by schedule or manually), to the Director.  Director then pushes the list to each ProxySG within the Content Distribution list.  Installation requires a Windows or LINUX workstation on which to install the Content Sync Module (CSM) component. Managing the Director appliance requires a Java applet, downloadable from the Director appliance, upon login. The CSM operates by crawling a CIFS, or HTTP  server and tracks the time that the content was last modified, pushing the content to  the ProxySG appliances accordingly. It will generate a list of files along with their last-modified times and keep that in a flat file rather than a true database.  You can then push this content out to your SG network, via Director.  These appliances will get that list, delete the objects that shouldn’t be there anymore and download the other files.

Note: The Content Sync Module does not ship with Director, but  is available as a separately downloable module.  This article assumes you have a working DIrector appliance, and are able to log into it using the Director Management Console (DMC) Java application. For details on how to install the CSM,  see KB4481

To create a job that crawls a website for content, follow these steps:

  • Invoke CSM on your workstation and click on File and select New Job- you can give it any name you want- we chose the default name, as you can see below.
  • Click Schedule box, select Recurring tab and specify Day of the Week and Time.

 

  • Click Add > and confirm your schedule entered into the Result Schedule: (Alternatively you could select One Time tab for scheduling purposes.
  • Under the New Job just created, select Scan > select radio button Crawl URLS.
  • The URL needs to be preceeded by http:// as the example below shows.
  • Select the radio buttons, as per your prefernces.

  • Select Blue Coat Director under Jobs and enter Director IP address and administrator credentials.
  • You can specify the protocol connection method as either Telnet or SSHv2 (  The CSM software will use the Expect software you installed previously to connect to your Direc)

  • Select Synchronize item under Jobs and click Enable Synchronization. Note:  Synchronization allows the  CSM software to update ProxySGs that have information that has  been changed. 
  • TIP: If you do not select this option, the CSM sofware will only download the content to your local host, and NOT contact the Director appliance.
  • Select "Synchronize all Devices and Groups":  The default is to synchronize all devices and groups associated with Director. 

 

  • NOTE: The output for the scan will be placed in : C:/Program Files/Expect-5.21/bin/Output folder.  This folder only gets created when you do a your first scan.
  • The scan may take up to 20 mins to complete, depending on the Website you choose.
  • On a successfully scan, the output log will look something like this, except for  the enable password errors.

 

Points to note:
The latest CSM log keeps track of the latest version of each object on the chosen website. In other words, the CSM does not download, cache, and push actual content out to the SG, it merely keeps track of what content there is to cache, and hands that list off to each SG, so it can download, and cache the objects. The list is displayed  in the tracking window when a job first starts to run, and provides a progress report  of every 500 objects scanned or crawled. The log is kept in the data directory where  you installed CSM, identified by a timestamp. This log is in non-verbose mode and is called CsmGuiJobs.txt .  You can change the default to verbose mode by using Tools>Options>Verbose Mode in the main window.

The CSM Configuration file contains, in the  all the settings for one job. Each new job has its own configuration file, located in C:/Program Files/Expect-5.21/bin/data. The first CSM configuration file for the job you create is titled csm.cfg. Each new job has its own configuration file; for example, csm001.cfg, csm002.cfg, and so on. Each time the job is run, the csmXXX.cfg file is output in the data directory with a timestamp, so you can see what changes you made in each running of the job.

The CSM Configuration file,called CSM001.cfg, is kept in the same folder, and  should not be edited directly. Most of the settings can be changed through the Management Console standard windows; a few can be made only through the Advanced window of the Management Console. (These few settings generally do not need to be changed; the defaults are usually satisfactory.

Compatablity:

The recomended platform for the Content Sync Module, and Expect, is Microsoft Windows XP with service pack 3 installed.  There are known problems with this software being installed to Windows 7, and 64 bit Windows 2003 servers. 

 

Frequently asked questions:

1: When we create a CIFS crawl job  what is the correct entry for the "Corresponding URL" box?  If you leave this option blank the job does not run, so what must I place here?

The Coresponding URL is used only when you are scanning Directories. Since the Director appliance/SG network can only distribute URLS we need to send out URLS. Each Url uses this syntax "file://<SG IP address>

Here's a sample output of a CSM job pulling files from the default 'Sample pictures director on a windows workstation.

Using username "admin".
Last login: Wed Feb  8 05:02:50 2012 from 10.125.48.32 
  Copyright (c) 1997-2010, BlueCoat Systems, Inc.
 
  Welcome to SG-ME 5.5.1.2 #65441 2011.05.03-034023 
 
Director > 
Director> enable 
DIrector # cli help disable
DIrector # line-vty length 0
DIrector # content distribute url "
file://10.125.0.51/Sample%20Pictures/Winter.jpg" all
Command ID: 1328677458899394 
 
DIrector # content distribute url "
file://10.125.0.51/Sample%20Pictures/Water%20lilies.jpg" all
Command ID: 1328677459240196 
 
DIrector # content distribute url "
file://10.125.0.51/Sample%20Pictures/desktop.ini" all
Command ID: 1328677459538524 
 
DIrector # content distribute url "
file://10.125.0.51/Desktop.ini" all
Command ID: 1328677459845087 
 
DIrector # content distribute url "
file://10.125.0.51/Sample%20Pictures/Sunset.jpg" all
Command ID: 1328677460135670 
 
DIrector # content distribute url "
file://10.125.0.51/Sample%20Pictures/Thumbs.db" all
Command ID: 1328677460342931 
 
DIrector # content distribute url "
file://10.125.0.51/Sample%20Pictures/Blue%20hills.jpg" all
Command ID: 1328677460550356 
 
DIrector # exit

Blue Coat Systems CSM/SG-ME 5.3.0.1 #32468 2008.01.30-083843 ended: Wed Feb 08 10:23:26 India Standard Time 2012

2: Why do we see URLS like this? "file://10.125.0.51/<file:///\\10.125.0.51\>"

This is because of outlook html format. It automatically identifies and converts them as hyperlinks. When you see it in normal text mode it will display text and link like that.
For more clarity here is an example:

If you have a directory (C:\MyDir\) contents as mentioned below :
C:\MyDir\test.jpg
C:\MyDir\abc.gif
C:\MyDir\Test1\image1.jpg

And you are scanning that directory  (C:\MyDir\) using CSM and provided Corresponding URL as “file://testserver/” then CSM will generate and distribute below URLS:
file://testserver/test.jpg
file://testserver/abc.gif
file://testserver/Test1/image1.jpg

That means whatever director you are scanning will be replaced by Corresponding URL.

3: Does the Content Sync Module ( CSM) application create jobs on the Director appliance? 

No, the the CSM does not create a job on the Director appliance. It runs only when triggered by the CSM application.  Each time it runs it uses Director  CLI commands to execute the tasks on the Director appliance.

You can use also use the Query option provided in the CSM to know the caching status of the URLs that you have distributed.  Here is an example screenshot:

 

Other articles:

For details on how to create a job to scan a CIFS server, see KB4515

For a definition of what it means to crawl a webserver, see WIKI site.

For details on a known problem with CSM and timezone changes, see KB4483

For a list of Proxy SG version compability with Director SGME 5.5.1.2, see KB1568

For details on what problems you may face launching the Director Managment console Java application, see KB4383

For details on helpful Director command Line syntax, see KB4178


Rate this Page

Please take a moment to complete this form to help us better serve you.

Did this document help answer your question?
 
 
If you are finished providing feedback, please click the RATE CONTENT button. Otherwise, please add more detail in the following text box and then click RATE CONTENT.
 
 

Your response will be used to improve our document content.

Ask a Question