• http://www.techneurons.com/career/
  • experienced programming consultants for hire !!!

    Contact Now

    C#.NET Articles

    KB: Simple Website Scraping: Extract all text contents from webpage

    THURSDAY, APRIL 8, 2010

    Web Scraping or Web Data extraction is a software technique for extracting information from web pages. There are different types of web scraping, and a most common one is the web crawling done by search engines. The text content of the website will be extracted from web pages, and will be indexed for searching. This article explains a simple method of extracting text content using C# and Regular Expressions.

    The System.Net.WebClient object is used in .NET for accessing data from external web sources. This can work as a proxy to download HTML contents, or MIME files from any browser, and can support authentication to access the server if needed. You can also add custom HTTP headers to your server requests, so you can send a custom UserAgent for your scraping application.

    Depending on the purpose of your application, you can choose to develop a web application, or  a desktop utility or a server program. This example shows a C# method snippet which extracts text content from a given URL. You can use this function for any type of application, but make sure it has necessary security rights to access an internal internet resource.

    public string scrapeWebsite(string url)
        {
            string extractedContent = "";
     
            WebClient wc = new WebClient();
            wc.Headers.Add("HTTP_USER_AGENT", "Web-Scraper-Agent (your-custom-user-agent-here)");
            try
            {
                // Download the web page content from the URL
                extractedContent = wc.DownloadString(url);
     
                //Remove CSS styles, if any found
                extractedContent = Regex.Replace(extractedContent, "<style(.| )*?>*</style>", "");
                //Remove script blocks
                extractedContent = Regex.Replace(extractedContent, "<script(.| )*?>*</script>", "");
                // Remove all images
                extractedContent = Regex.Replace(extractedContent, "<img(.| )*?/>", "");
                // Remove all HTML tags, leaving on the text inside.
                extractedContent = Regex.Replace(extractedContent, "<(.| )*?>", "");
                // Remove all extra spaces, tabs and repeated line-breaks
                extractedContent = Regex.Replace(extractedContent, "(x09)?", "");
                extractedContent = Regex.Replace(extractedContent, "(x20){2,}", " ");
                extractedContent = Regex.Replace(extractedContent, "(x0Dx0A)+", " ");
            }
            catch (Exception e)
            {
                extractedContent = "Error on downloading: " + e.Message;
            }
            return extractedContent;
        }

    Higher level applications of web scraping extracts specific information from the web pages rather than going for content as a whole, like headlines and paragraphs, images, product details and prices, exchange rates, etc. Higher level parsing would be required for these applications, and would most require to write specific codes to work with different content providers.

    ConsultSarath - We provide end to end outsourcing solutions for .net programming requirements- you can hire programmer for hourly rates, for monthly commitments, for short term projects, for long term projects, Contact to know our hourly rates for programmer in India. ConsultSarath - We provide end to end outsourcing solutions for php programming requirements- you can hire programmer for hourly rates, for monthly commitments, for short term projects, for long term projects, Contact to know our hourly rates for programmer in India. ConsultSarath - We provide end to end outsourcing solutions for python programming requirements- you can hire programmer for hourly rates, for monthly commitments, for short term projects, for long term projects, Contact to know our hourly rates for programmer in India.

    Other Popular Articles
    We are experts in Cloud Computing Technologies. We can assist you to build high scalable business applications using Amazon Web Services (Amazon EC2, Amazon S3, Amazon SES, SNS, CloudFront), Windows Azure Platforms - Windows Azure and SQL Server Azure, Google App Engine using Python and Django Framework. We are Expert Programming Consultants available at affordable rates per hour. We work on several technologies - .NET, Python, Google App Engine, PHP, Windows Azure, Amazon Web Services ...