Python抓取Discuz!用户名脚本代码

2025-02-23 23:50:44

最近学习Python，于是就用Python写了一个抓取Discuz!用户名的脚本，代码很少但是很搓。思路很简单，就是正则匹配title然后提取用户名写入文本文档。程序以百度站长社区为例(一共有40多万用户)，挂在VPS上就没管了，虽然用了延时但是后来发现一共只抓取了50000多个用户名就被封了。。。
代码如下：

代码如下:

# -*- coding: utf-8 -*-
# Author: 天一
# Blog: http://www.90blog.org
# Version: 1.0
# 功能: Python抓取百度站长平台用户名脚本

import urllib
import urllib2
import re
import time

def BiduSpider():
     pattern = re.compile(r'<title>(.*)的个人资料百度站长社区 </title>')
     uid=1
     thedatas = []
     while uid <400000:
         theUrl = "http://bbs.zhanzhang.baidu.com/home.php?mod=space&uid="+str(uid)
         uid +=1
         theResponse = urllib2.urlopen(theUrl)
         thePage = theResponse.read()
         #正则匹配用户名
         theFindall = re.findall(pattern,thePage)
         #等待0.5秒，以防频繁访问被禁止
         time.sleep(0.5)
         if theFindall :
              #中文编码防止乱码输出
              thedatas = theFindall[0].decode('utf-8').encode('gbk')
              #写入txt文本文档
              f = open('theUid.txt','a')
              f.writelines(thedatas+'\n')
              f.close()

if __name__ == '__main__':
BiduSpider()

最终成果如下：

python抓取网页图片示例(python爬虫)

复制代码代码如下: #-*- encoding: utf-8 -*-'''Created on 2014-4-24 @author: Leon Wong''' import urllib2import urllibimport reimport timeimport osimport uuid #获取二级页面urldef findUrl2(html): re1 = r'http://tuchong.com/\d+/\d+/|http://\w+(?<!photos).tuchong.co
python抓取豆瓣图片并自动保存示例学习

环境Python 2.7.6,BS4,在powershell或命令行均可运行.请确保安装了BS模块复制代码代码如下: # -*- coding:utf8 -*-# 2013.12.36 19:41 wnlo-c209# 抓取dbmei.com的图片. from bs4 import BeautifulSoupimport os, sys, urllib2 # 创建文件夹,昨天刚学会path = os.getcwd() # 获取此脚本所在目录new_path = os.pat
python使用beautifulsoup从爱奇艺网抓取视频播放

复制代码代码如下: import sysimport urllibfrom urllib import requestimport osfrom bs4 import BeautifulSoup class DramaItem: def __init__(self, num, title, url): self.num = num self.title = title self.url = url def __str__(self):
python小技巧之批量抓取美女图片

其中用到urllib2模块和正则表达式模块.下面直接上代码: [/code]#!/usr/bin/env python#-*- coding: utf-8 -*-#通过urllib(2)模块下载网络内容import urllib,urllib2,gevent#引入正则表达式模块,时间模块import re,timefrom gevent import monkey monkey.patch_all() def geturllist(url): url_list=[] print ur
python正则匹配抓取豆瓣电影链接和评论代码分享

复制代码代码如下: import urllib.requestimport reimport time def movie(movieTag): tagUrl=urllib.request.urlopen(url) tagUrl_read = tagUrl.read().decode('utf-8') return tagUrl_read def subject(tagUrl_read): ''' 这里还存在问题: ①这只针对单独的一页进行排序,而没有
python抓取京东商城手机列表url实例代码

复制代码代码如下: #-*- coding: UTF-8 -*-'''Created on 2013-12-5 @author: good-temper''' import urllib2import bs4import time def getPage(urlStr): ''' 获取页面内容 ''' content = urllib2.urlopen(urlStr).read() return content def getNextPag
python抓取网页图片并放到指定文件夹

python抓取网站图片并放到指定文件夹复制代码代码如下: # -*- coding=utf-8 -*-import urllib2import urllibimport socketimport osimport redef Docment(): print u'把文件存在E:\Python\图(请输入数字或字母)' h=raw_input() path=u'E:\Python\图'+str(h) if not os.path.exists(path):
Python实现抓取网页并且解析的实例

本文以实例形式讲述了Python实现抓取网页并解析的功能.主要解析问答与百度的首页.分享给大家供大家参考之用. 主要功能代码如下: #!/usr/bin/python #coding=utf-8 import sys import re import urllib2 from urllib import urlencode from urllib import quote import time maxline = 2000 wenda = re.compile("href=\"htt
python抓取网页中的图片示例

复制代码代码如下: #coding:utf8import reimport urllibdef getHTML(url): page = urllib.urlopen(url) html = page.read() return html def getImg(html,imgType): reg = r'src="(.*?\.+'+imgType+'!slider)" ' imgre = re.compile(reg) imgList = re.
python多线程抓取天涯帖子内容示例

使用re, urllib, threading 多线程抓取天涯帖子内容,设置url为需抓取的天涯帖子的第一页,设置file_name为下载后的文件名复制代码代码如下: #coding:utf-8 import urllibimport reimport threadingimport os, time class Down_Tianya(threading.Thread): """多线程下载""" def __init__(sel
python实现从web抓取文档的方法

本文实例讲述了Python实现从Web的一个URL中抓取文档的方法,分享给大家供大家参考.具体方法分析如下: 实例代码如下: import urllib doc = urllib.urlopen("http://www.python.org").read() print doc#直接打印出网页 def reporthook(*a): print a #将http://www.renren.com网页保存到renre.html中, #每读取一个块调用一字reporthook函数 urll
Python抓取京东图书评论数据

京东图书评论有非常丰富的信息,这里面就包含了购买日期.书名.作者.好评.中评.差评等等.以购买日期为例,使用Python + Mysql的搭配进行实现,程序不大,才100行.相关的解释我都在程序里加注了: from selenium import webdriver from bs4 import BeautifulSoup import re import win32com.client import threading,time import MySQLdb def mydebug():
python抓取网页时字符集转换问题处理方案分享

问题提出: 有时候我们采集网页,处理完毕后将字符串保存到文件或者写入数据库,这时候需要制定字符串的编码,如果采集网页的编码是gb2312,而我们的数据库是utf-8的,这样不做任何处理直接插入数据库可能会乱码(没测试过,不知道数据库会不会自动转码),我们需要手动将gb2312转换成utf-8. 首先我们知道,python里的字符默认是ascii码,英文当然没问题啦,碰到中文的时候立马给跪. 不知道你还记不记得,python里打印中文汉字的时候需要在字符串前面加 u: print u"来搞基吗?&

Python抓取Discuz!用户名脚本代码

相关推荐

随机推荐