WebScraper Class

IronWebScraper

WebScraper Class

A base class which developers can extend to build custom web-scraping applications.
An easy to use base class which developers can extend to rapidly build custom web-scraping applications.
Inheritance Hierarchy
SystemObject  IronWebScraperWebScraper

Namespace:  IronWebScraper
Assembly:  IronWebScraper (in IronWebScraper.dll) Version: 4.0.4.25470 (4.0.4.3)
Syntax
public abstract class WebScraper
Public MustInherit Class WebScraper

The WebScraper type exposes the following members.

Properties
  NameDescription
Public propertyFailedUrls
Gets the number of failed http requests which have exceeded their total maximum number of retries.
Public propertyHttpRetryAttempts
The number of times WebScraper will retry a failed URL (normally with a new identity) before considering it non-scrapable.
Public propertyHttpTimeOut
Gets or the time after-which a HTTP request will be considered failed or lost. (non-contactable or Dns unavailable)
Public propertyMaxHttpConnectionLimit
Gets or sets the total number of allowed open HTTP requests (threads)
Public propertyOpenConnectionLimitPerHost
Gets or sets the allowed number of concurrent HTTP requests (threads) per hostname or IP address. This helps protect hosts against too many requests.
Public propertyRateLimitPerHost
Gets or sets minimum polite delay (pause) between request to a given domain or IP address.
Public propertySuccessfulFileDownloadCount
Gets the number of successful http downloads using the DownloadFile and DownloadImage methods..
Public propertySuccessfulfulRequestCount
Gets the number of successful http requests.
Public propertyThrottleMode
Makes the WebSraper intelligently throttle requests not only by hostname, but also by host servers' IP addresses. This is polite in-case multiple scraped domains are hosted on the same machine.
Top
Methods
  NameDescription
Public methodAcceptUrl
Decides if the WebScraper will accept a given url. My be overridden to apply custom middleware logic.
Public methodChooseIdentityForRequest
Picks a random identity from WebScraper.Identities for each request. Add Identities with proxy IP addresses, userAgents, headers, cookies, username and password in your Init Method and add them to the WebScraper.Identities List;

Override this method to create your own logic for non-random selection of a HttpIdentity for each request.

Public methodDownloadFile(String, String, Boolean, HttpIdentity)
Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.

Normally called with an Parse Method of IronWebScraper.WebScraper

Public methodDownloadFile(Uri, String, Boolean, HttpIdentity)
Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.

Normally called with an Parse Method of IronWebScraper.WebScraper

Public methodDownloadFileUnique
Much like DownloadFile except if the file has already been downloaded or exists locally, it will not be re-downloaded.

Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.

Normally called with an Parse Method of IronWebScraper.WebScraper

Public methodDownloadImage(String, String, Int32, Int32, Boolean, HttpIdentity)
Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.

Normally called with an Parse Method of IronWebScraper.WebScraper

Public methodDownloadImage(Uri, String, Int32, Int32, Boolean, HttpIdentity)
Requests a file to be downloaded from the given Url to the local file-system. Often used for scraping documents, assets and images.

Normally called with an Parse Method of IronWebScraper.WebScraper

Public methodEnableWebCache
Caches web http responses for reuse. This allows WebScraper classes to be modified and restarted without re-downloading previously scraped urls.
Public methodEnableWebCache(TimeSpan)
Caches web http responses for reuse. This allows WebScraper classes to be modified and restarted without re-downloading previously scrape urls.
Public methodEquals (Inherited from Object.)
Public methodStatic memberFetchUrlContents
A handy shortcut method that fetches the text content from any Url (synchronously).
Public methodFetchUrlContentsBinary
A handy shortcut method that fetches the text content from any Url (synchronously) as a binary data in a byye array (byte[])
Public methodGetHashCode (Inherited from Object.)
Public methodGetType (Inherited from Object.)
Public methodInit
Override this method initialize your web-scraper. Important tasks will be to Request at least one start url... and set allowed/banned domain or url patterns.
Public methodLog
Logs the specified message to the console. Logs can be Enabled using the EnableLogging. This function has been exposed and is over-ridable to allow for easy Email and Slack notification integration.
Public methodObeyRobotsDotTxtForHost
Causes the WebScraper to always obey /robots.txt directives including path restrictions and crawl rates on a domain by domain basis. May be overridden for advanced control.
Public methodParse
Override this method to create the default Response handler for your web scraper. If you have multiple page types, you can add additional similar methods.
Public methodParseWebscraperDownload
Internal method to parse binary files downloaded by a webScraper.
Public methodParseWebscraperDownloadImage
Internal method to parse images downloaded by a webScraper.
Public methodPostRequest(String, ActionResponse, DictionaryString, String)
Request adds a new request to the scrape-job queue using the POST http method.
Public methodPostRequest(Uri, ActionResponse, DictionaryString, String)
Request adds a new request to the scrape-job queue using the POST http method.
Public methodPostRequest(String, ActionResponse, DictionaryString, String, MetaData)
Request adds a new request to the scrape-job queue using the POST http method.
Public methodPostRequest(Uri, ActionResponse, DictionaryString, String, MetaData)
Request adds a new request to the scrape-job queue using the POST http method.
Public methodPostRequest(String, ActionResponse, DictionaryString, String, HttpIdentity, MetaData)
Request adds a new request to the scrape-job queue using the POST http method.
Public methodPostRequest(Uri, ActionResponse, DictionaryString, String, HttpIdentity, MetaData)
Request adds a new request to the scrape-job queue using the POST http method.
Public methodRequest(IEnumerableString, ActionResponse)
A key method called from with the Init and Parse Methods. Request adds new requests to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.
Public methodRequest(String, ActionResponse)
A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.
Public methodRequest(Uri, ActionResponse)
A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.
Public methodRequest(String, ActionResponse, MetaData)
A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.
Public methodRequest(Uri, ActionResponse, MetaData)
A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.
Public methodRequest(String, ActionResponse, HttpIdentity, MetaData)
A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.
Public methodRequest(Uri, ActionResponse, HttpIdentity, MetaData)
A key method called from with the Init and Parse Methods. Request adds a new request to the scrape-job queue, and decides which method (e.g. Parse) will be used to parse the Response object.
Public methodRetry
Retries a Response.

Usually called in a Parse method, this method is useful if a Captcha or error screen was encountered during Html parsing.

Public methodScrape
Appends any scraped data to a file in the JsonLines format. (1 json object per line). Will save any .Net object of any kind. This method is typically used with IronWebScraper.ScrapedData or developer defined classes for scraped data items. The default filename will follow the pattern "NameSpace.TypeName.jsonl". E.g: IronWebScraper.ScrapedData.jsonl
Public methodScrapeUnique
Appends scraped data to a file in the JsonLines format. (1 json object per line). Automatically ignores duplicates. Will save any .Net object of any kind. This method is typically used with IronWebScraper.ScrapedData or developer defined classes for scraped data items. The default filename will follow the pattern "WorkingDirecory/NameSpace.TypeName.jsonl". E.g: Scrape/IronWebScraper.ScrapedData.jsonl
Public methodSetSiteSpecificCrawlRateLimit
Set a throttle limit for a specific domain
Public methodStart
Starts the WebScraper.

Set CrawlId to make this crawl resumable. Will also resume a previous scrawl with the same CrawlId if it exists.

Giving a CrawlId also causes the WebScraper to auto-save its state every 5 minutes in case of a crash, system failure or power outage. This feature is particularly useful for long running web-scraping tasks, allowing hours, days or even weeks of work to be recovered effortlessly.

Public methodStartAsync
Starts the WebScraper Asynchronously. Set CrawlId to make this crawl resumable. Will resume a previous scrawl with the same CrawlId if it exists.
Public methodStop
Stops this WebScraper instance graceful. The WebScraper may be restated later with no loss of data by calling Start(CrawlId) or StartAsync(CrawlId)
Public methodToString (Inherited from Object.)
Public methodUnScrape(Boolean)
Retrieves IronWebScraper.ScrapedData objects which were saved using the WebScraper.Scrape method.
Public methodUnScrape(String, Boolean)
Retrieves IronWebScraper.ScrapedData objects which were saved using the WebScraper.Scrape method.
Public methodUnScrapeT(Boolean)
Retrieves native C# objects which were saved using the WebScraper.Scrape method in the JsonLines format.
Public methodUnScrapeT(String, Boolean)
Retrieves native C# objects which were saved using the WebScraper.Scrape method in the JsonLines format.
Top
Fields
  NameDescription
Public fieldAllowedDomains
If not empty, all requested Urls' hostname must match at least one of the AllowedDomains patterns. Patterns may be added using glob wildcard strings or Regex
Public fieldAllowedUrls
If not empty, all requested Urls must match at least one of the AllowedUrls patterns. Patterns may be added using glob wildcard strings or Regex
Public fieldBannedDomains
If not empty, no requested Urls' hostname may match any of the BannedDomains patterns. Patterns may be added using glob wildcard strings or Regex
Public fieldBannedUrls
If not empty, no requested Urls may match any of the BannedUrls patterns. Patterns may be added using glob wildcard strings or Regex
Public fieldCrawlId
A unique string used to identify a crawl job.
Public fieldFilesDownloaded
The total number of files downloaded successfully with the DownloadImage and DownloadFile methods.
Public fieldIdentities
A list of http identities to be used to fetch web resources.

Each Identity may have a different proxy IP addresses, userAgent, http headers, persistent cookies, username and password.

Best practice is to create Identities in your WebScraper.Init Method and Add them to this WebScraper.Identities List.

Public fieldLoggingLevel
The level of logging made by the WebScraper engine to the Console.

LogLevel.Critical is normally the most useful setting, allowing the developer to write their own, meaningful and application relevant messages inside of Parse methods.

LogLevel.ScrapedData is useful when coding and testing a new WebScraper.

Public fieldObeyRobotsDotTxt
Causes the WebScraper to always obey /robots.txt directives including url and path restrictions and crawl rates.
Public fieldWorkingDirectory
Path to a local directory where scraped data and state information will be saved.
Top
See Also