Web Scraping or Web Data extraction is a software technique for extracting information from web pages. There are different types of web scraping, and a most common one is the web crawling done by search engines. The text content of the website will be extracted from web pages, and will be indexed for searching. This article explains a simple method of extracting text content using C# and Regular Expressions.
The System.Net.WebClient object is used in .NET for accessing data from external web sources. This can work as a proxy to download HTML contents, or MIME files from any browser, and can support authentication to access the server if needed. You can also add custom HTTP headers to your server requests, so you can send a custom UserAgent for your scraping application.
Depending on the purpose of your application, you can choose to develop a web application, or a desktop utility or a server program. This example shows a C# method snippet which extracts text content from a given URL. You can use this function for any type of application, but make sure it has necessary security rights to access an internal internet resource.
public string scrapeWebsite(string url)
{
string extractedContent = "";
WebClient wc = new WebClient();
wc.Headers.Add("HTTP_USER_AGENT", "Web-Scraper-Agent (your-custom-user-agent-here)");
try
{
// Download the web page content from the URL
extractedContent = wc.DownloadString(url);
//Remove CSS styles, if any found
extractedContent = Regex.Replace(extractedContent, "<style(.| )*?>*</style>", "");
//Remove script blocks
extractedContent = Regex.Replace(extractedContent, "<script(.| )*?>*</script>", "");
// Remove all images
extractedContent = Regex.Replace(extractedContent, "<img(.| )*?/>", "");
// Remove all HTML tags, leaving on the text inside.
extractedContent = Regex.Replace(extractedContent, "<(.| )*?>", "");
// Remove all extra spaces, tabs and repeated line-breaks
extractedContent = Regex.Replace(extractedContent, "(x09)?", "");
extractedContent = Regex.Replace(extractedContent, "(x20){2,}", " ");
extractedContent = Regex.Replace(extractedContent, "(x0Dx0A)+", " ");
}
catch (Exception e)
{
extractedContent = "Error on downloading: " + e.Message;
}
return extractedContent;
}
Higher level applications of web scraping extracts specific information from the web pages rather than going for content as a whole, like headlines and paragraphs, images, product details and prices, exchange rates, etc. Higher level parsing would be required for these applications, and would most require to write specific codes to work with different content providers.