![]() Capt. Horatio T.P. Webb |
Parks -- SPRING 2009 | |
The most common activity on the web is searching the web using the large variety of search engines. Companies like Google who do this for a living spend a tremendous amount of effort in wandering around the web, finding and storing the search results in an easily accessible fashion. This processes is generally a trade secret of the firms that perform this service. However, it is NOT rocket science.
We can employ ASP to do this task in VERY limited way using server side AJAX. Suppose we wish to use the MIS 4372 syllabus as a starting point and develop a spider that crawls the homepage, looks around and then starts searching all the pages that appear as links on the page. It is just a matter of storing the results of the searches and their URLs.
The code uses the IE version of AJAX to read the page's content:
set xmlhttp = CreateObject("MSXML2.ServerXMLHTTP")
xmlhttp.open "GET", full_url, false
xmlhttp.send ""
page_content=xmlhttp.responseText
It sends no data to the page -- it just retrieves the page and stores it in the variable "page_content". Now the entire page is just a string. The code (as shown below) will search the course syllabus page at http://www.bauer.uh.edu/parks/disc4372.htm (the string "page_content") and retrieve all the links. It does this by:
Once the list of local URL's has been made, it uses the same AJAX to retrieve each page that is linked, stores the page content as a string then tries to find the desired search term (in the code this stored in a variable named "search_for").
Execute this search here (i.e., find "Snidely Whiplash" by crawling my MIS 4372 syllabus).
Because this page is on the syllabus, this page is searched (along with lotsa others). Because this page contains the value of the "search_for" variable in the code shown above (search_for="Snidely Whiplash"), this page is the only one identified as a successful search.
While you might assume that you could write this code and execute it in a browser -- You CANNOT do this. This is because many "cross-site scripting" features have been disallowed by the browser makers. As this example demonstrates, it is possible to do some things inside your own sandbox. Further, without this server-side capability, search engines would be unable to automate the web crawling ability that allows them to search and index the web.
Be careful what you ask for...