一聚教程网:一个值得你收藏的教程网站

最新下载

热门教程

C#.Net基于正则表达式抓取百度百家文章列表的方法示例

时间:2017-08-25 编辑:猪哥 来源:一聚教程网

工作之余,学习了一下正则表达式,鉴于实践是检验真理的唯一标准,于是便写了一个利用正则表达式抓取百度百家文章的例子,具体过程请看下面源码:

一、获取百度百家网页内容

publicList GetUrl()
{
  try
  {
    stringurl ="http://baijia.baidu.com/";
    WebRequest webRequest = WebRequest.Create(url);
    WebResponse webResponse = webRequest.GetResponse();
    StreamReader reader =newStreamReader(webResponse.GetResponseStream());
    stringresult = reader.ReadToEnd();
    reader.Close();
    webResponse.Close();
    returnAnalysisHtml(result);
  }
  catch(Exception ex)
  {
    throwex;
  }
}

二、通过正则表达式筛选

publicList AnalysisHtml(stringhtmlContent)
{
  List list =newList();
  stringstrPattern ="

(?[^<]+)</a></h3>.*\s*<p\s*class="feeds-item-text">(?<Abstract>[^<]+)<a\s*href="(?<Url>.*)"\s*target="_blank"\s*class="feeds-item-more"\s*mon=".*\s*">.*\s*</a></p>"; Regex regex =newRegex(strPattern, RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.CultureInvariant); if(regex.IsMatch(htmlContent)) { MatchCollection matchCollection = regex.Matches(htmlContent); foreach(Match matchinmatchCollection) { string[] str =newstring[3]; str[0] = match.Groups[1].Value;//获取到的是列表数据的标题 str[1] = match.Groups[2].Value;//获取到的是内容 str[2] = match.Groups[3].Value;//获取到的是链接到的地址 list.Add(str); } } returnlist; }</pre> </div> </div> </div> <div class="pages art-detail"> </div> <ul class="TurnPage"> <li class="TurnPage-left"> <p> <span>上一个:</span> <a href="https://www.111com.net/net/145311.htm" class="maxWidth">VS2015 IIS Express无法启动的解决方法</a> </p> </li> <li class="TurnPage-right"> <p> <span>下一个:</span> <a href="https://www.111com.net/net/145361.htm" class="maxWidth">C#生成随机数功能示例</a> </p> </li> </ul> <div class="articles"> <div class="tit02"> <h4>相关文章</h4> </div> <ul> <li> <a target="_blank" href="https://www.111com.net/wy/222418.htm">js实现音乐播放器代码展示</a> <span>10-12</span> </li> <li> <a target="_blank" href="https://www.111com.net/wy/222363.htm">js实现图片查看器代码展示</a> <span>10-12</span> </li> <li> <a target="_blank" href="https://www.111com.net/jsp/222354.htm">JS中switch的四种写法介绍</a> <span>10-12</span> </li> <li> <a target="_blank" href="https://www.111com.net/jsp/221878.htm">js实现新闻轮播效果教程</a> <span>10-11</span> </li> <li> <a target="_blank" href="https://www.111com.net/wy/221732.htm">JS实现简单的图片切换功能教程</a> <span>10-11</span> </li> <li> <a target="_blank" href="https://www.111com.net/jsp/219003.htm">js实现下拉刷新和上拉加载解析</a> <span>09-28</span> </li> </ul> </div> </div> </div> </div> </div> <div class="hot-column"> <div class="cont"> <div class="tit"> <h4>热门栏目</h4> </div> <ul class="clearfix"> <li> <h6><a href="https://www.111com.net/phper/php.html" target="_blank">php教程</a></h6> <a href="https://www.111com.net/list-45/" target="_blank">php入门</a> <a href="https://www.111com.net/list-46/" target="_blank">php安全</a> <a href="https://www.111com.net/list-47/" target="_blank">php安装</a> <a href="https://www.111com.net/list-48/" target="_blank">php常用代码</a> <a href="https://www.111com.net/list-49/" target="_blank">php高级应用</a> </li> <li> <h6><a href="https://www.111com.net/net/net.html" target="_blank">asp.net教程</a></h6> <a href="https://www.111com.net/list-78/" target="_blank">基础入门</a> <a href="https://www.111com.net/list-79/" target="_blank">.Net开发</a> <a href="https://www.111com.net/list-80/" target="_blank">C语言</a> <a href="https://www.111com.net/list-81/" target="_blank">VB.Net语言</a> <a href="https://www.111com.net/list-82/" target="_blank">WebService</a> </li> <li> <h6><a href="https://www.111com.net/sj/index.html" target="_blank">手机开发</a></h6> <a href="https://www.111com.net/list-208/" target="_blank">安卓教程</a> <a href="https://www.111com.net/list-209/" target="_blank">ios7教程</a> <a href="https://www.111com.net/list-210/" target="_blank">Windows Phone</a> <a href="https://www.111com.net/list-211/" target="_blank">Windows Mobile</a> <a href="https://www.111com.net/list-212/" target="_blank">手机常见问题</a> </li> <li> <h6><a href="https://www.111com.net/cssdiv/css.html" target="_blank">css教程</a></h6> <a href="https://www.111com.net/list-99/" target="_blank">CSS入门</a> <a href="https://www.111com.net/list-100/" target="_blank">常用代码</a> <a href="https://www.111com.net/list-101/" target="_blank">经典案例</a> <a href="https://www.111com.net/list-102/" target="_blank">样式布局</a> <a href="https://www.111com.net/list-103/" target="_blank">高级应用</a> </li> <li> <h6><a href="https://www.111com.net/wy/yw.html" target="_blank">网页制作</a></h6> <a href="https://www.111com.net/list-136/" target="_blank">设计基础</a> <a href="https://www.111com.net/list-137/" target="_blank">Dreamweaver</a> <a href="https://www.111com.net/list-138/" target="_blank">Frontpage</a> <a href="https://www.111com.net/list-139/" target="_blank">js教程</a> <a href="https://www.111com.net/list-140/" target="_blank">XNL/XSLT</a> </li> <li> <h6><a href="https://www.111com.net/office/index.html" target="_blank">办公数码</a></h6> <a href="https://www.111com.net/list-236/" target="_blank">word</a> <a href="https://www.111com.net/list-237/" target="_blank">excel</a> <a href="https://www.111com.net/list-238/" target="_blank">powerpoint</a> <a href="https://www.111com.net/list-239/" target="_blank">金山WPS</a> <a href="https://www.111com.net/list-240/" target="_blank">电脑新手</a> </li> <li> <h6><a href="https://www.111com.net/jsp/jsp.html" target="_blank">jsp教程</a></h6> <a href="https://www.111com.net/list-68/" target="_blank">Application与Applet</a> <a href="https://www.111com.net/list-69/" target="_blank">J2EE/EJB/服务器</a> <a href="https://www.111com.net/list-70/" target="_blank">J2ME开发</a> <a href="https://www.111com.net/list-71/" target="_blank">Java基础</a> <a href="https://www.111com.net/list-72/" target="_blank">Java技巧及代码</a> </li> </ul> </div> </div> <div class="footer"> <div class="cont"> <p> <a href="https://www.111com.net/" target="_self">一聚教程网</a>| <a href="https://www.111com.net/us/us.html" class="about" target="_self">关于我们</a>| <a href="https://www.111com.net/us/me.html" class="contact" target="_self">联系我们</a>| <a href="https://www.111com.net/us/ads.html" class="gg_contact" target="_self">广告合作</a>| <a href="https://www.111com.net/us/link.html" class="friend_link" target="_self">友情链接</a>| <a href="https://www.111com.net/us/bcinfo.html" class="copyright_notice" target="_self">版权声明</a> </p> <p> <span>copyRight@2007-2024 www.111COM.NET AII Right Reserved <a href="https://beian.miit.gov.cn/" target="_blank" class="beian">苏ICP备17065847号-2</a> </span> </p> <p> <span> 网站内容来自网络整理或网友投稿如有侵权行为请邮件:yijucomnet@163.com 我们24小时内处理 </span> </p> </div> </div> <script src="https://assets.111com.net/js/stat.js?v=2024022101"></script> <script src="https://api.111com.net/api/stat/hits?type=article&id=145323"></script> </body> </html>