使用.NET组件编写邮箱搜集工具
从网页中搜集信息有两个难点需要解决:一是编写可以通过链接遍历网页的蜘蛛程序,这点ChilkatDotNet组件已经给我们提供了很好的支持.二是从网页中提取需要的信息,这点可以通过很多方式解决,这里我选择的是正则表达式. 先给一张程序运行时的截图: ![]() 界面的设计很简单,3个Textbox+1个RichTextBox+2个Button,3个Textbox分别用来输入站点地址,起始Url和需要遍历的链接数,RichTextBox用来存放搜集到的网页信息,这里我保存的是网页url和网页中的Email地址. 程序主要分为两部分,首先是遍历站点,代码如下: Chilkat.Spider spider = new Chilkat.Spider();![]() string website = this.textWebsite.Text; ![]() string url = this.textUrl.Text; ![]() int links = Int32.Parse(this.textLinks.Text); ![]() // The spider object crawls a single web site at a time. As you'll see ![]() // in later examples, you can collect outbound links and use them to ![]() // crawl the web. For now, we'll simply spider 10 pages of chilkatsoft.com ![]() spider.Initialize(website); ![]() // Add the 1st URL:![]() spider.AddUnspidered(url); ![]() // Begin crawling the site by calling CrawlNext repeatedly.![]() int i; ![]() for (i = 0; i <= links; i++) {![]() bool success; ![]() success = spider.CrawlNext(); ![]() if (success == true) { Invoke(new AppendTextDelegate(AppendText), new object[] { spider.LastUrl + "\r\n" }); GetAllURL(spider.LastUrl.ToString()); }![]() else {![]() // Did we get an error or are there no more URLs to crawl? ![]() if (spider.NumUnspidered == 0) {![]() MessageBox.Show("No more URLs to spider"); ![]() } ![]() else {![]() MessageBox.Show(spider.LastErrorText); ![]() } ![]() } ![]() // Sleep 1 second before spidering the next URL.![]() spider.SleepMs(1000); ![]() } 和ChilkatDotNet里的示例代码相似,只是增加了从文本框获取初始条件的代码.获取Url地址后,需要提取网页的内容,再根据正则表达式获取Email地址. 获取网页内容: HttpWebRequest webRequest1 = (HttpWebRequest)WebRequest.Create(new Uri(URlStr)); webRequest1.Method = "GET"; HttpWebResponse response = (HttpWebResponse)webRequest1.GetResponse(); Stream stream = response.GetResponseStream(); StreamReader streamReader = new StreamReader(stream, Encoding.Default); String textData = streamReader.ReadToEnd(); streamReader.Close(); response.Close();@"(?<EmailStr>\b[A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z]{2,4}\b)" 本文出自 51CTO.COM技术博客 |



Chilkat.Spider spider = new Chilkat.Spider();
zhcsmx22
博客统计信息
热门文章
最新评论
友情链接