-
Notifications
You must be signed in to change notification settings - Fork 0
kavithamadhavaraj/py_crawler
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
To install the dependencies :
>> pip install -r requirements.txt
To run the test cases:
>> python -m unittest discover -v
To run the main / driver program:
>> python main.py
Modules
*******
-crawler
Contains the core Crawler module
-retreiver
Contains Retriever which is the base class of both HTMLPageRetriever & XMLRetriever to collect the pages from web as instructed by Crawler
-utils
Includes some helper functions
Approach
********
There are two ways to run this crawler.
- Specify how many pages to crawl
cr1 = Crawler('http://www.windowscentral.com', url_count= 20, strict=True, timeout=5, multi=10)
- Crawl until interrupted
cr2 = Crawler('http://www.windowscentral.com/',strict=True, timeout=5, multi=10)
To interrupt do the following:
* Use ctrl+c
- Once the crawler is engaged, the initial url is placed in Queue and the crawling process is initiated.
- This process runs until the number of pages crawled matches with the input (Case 1) / there is nothing left in the processing queue (Case2)
- When a url obtained from queue is visited, if the mime-type is text/html - the extraction logic is delegated to HTMLPageRetriever. In case of text/xml it is delegated to XMLRetriever as separate threads
- Once a link is crawled, the URL without protocol and trailing '/' is stored & marked as visited. So that it isn't processed again when the same link occurs in any other pages.
- For all the child urls found in the page, following process are done
- same domain links that start with '/' , './' , '//' and '#' are converted into proper format
- url parameters are ignored
- duplicates are removed
- urls that starts with other formats like mailto: etc is ignored
- The mapping of pages and their child urls, known as directory is maintained through out the crawling process and is available as a JSON file.
- Timeout interval can be configured to give enough time for the retreiver. Default is 3 seconds
- Multi mode: Number of threads spawned by the crawler can be configured by the user. Default is 5 threads.
- Strict mode: Enables the crawler to restrict itself to the same domain as the input url.
- If there is any Timeout / HTTP error etc occurs during the visit, the page is ignored and the process continues with the next link from queue.
About
A web spyder written in Python
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published