Python爬取城市租房信息实战分享

目录
  • 一、单线程爬虫
  • 二、优化为多线程爬虫
  • 三、使用asyncio进一步优化
  • 四、存入Mysql数据库
    • (一)建表
    • (二)将数据存入数据库中
  • 五、最终效果图 (已打码)

思路:先单线程爬虫,测试可以成功爬取之后再优化为多线程,最后存入数据库

以爬取郑州市租房信息为例

注意:本实战项目仅以学习为目的,为避免给网站造成太大压力,请将代码中的num修改成较小的数字,并将线程改小

一、单线程爬虫

# 用session取代requests
# 解析库使用bs4
# 并发库使用concurrent
import requests
# from lxml import etree    # 使用xpath解析
from bs4 import BeautifulSoup
from urllib import parse
import re
import time
 
headers = {
    'referer': 'https://zz.zu.fang.com/',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
    'cookie': 'global_cookie=ffzvt3kztwck05jm6twso2wjw18kl67hqft; city=zz; integratecover=1; __utma=147393320.427795962.1613371106.1613371106.1613371106.1; __utmc=147393320; __utmz=147393320.1613371106.1.1.utmcsr=zz.fang.com|utmccn=(referral)|utmcmd=referral|utmcct=/; __utmt_t0=1; __utmt_t1=1; __utmt_t2=1; ASP.NET_SessionId=aamzdnhzct4i5mx3ak4cyoyp; Rent_StatLog=23d82b94-13d6-4601-9019-ce0225c092f6; Captcha=61584F355169576F3355317957376E4F6F7552365351342B7574693561766E63785A70522F56557370586E3376585853346651565256574F37694B7074576B2B34536C5747715856516A4D3D; g_sourcepage=zf_fy%5Elb_pc; unique_cookie=U_ffzvt3kztwck05jm6twso2wjw18kl67hqft*6; __utmb=147393320.12.10.1613371106'
}
data={
    'agentbid':''
}
 
session = requests.session()
session.headers = headers
 
# 获取页面
def getHtml(url):
    try:
        re = session.get(url)
        re.encoding = re.apparent_encoding
        return re.text
    except:
        print(re.status_code)
 
# 获取页面总数量
def getNum(text):
    soup = BeautifulSoup(text, 'lxml')
    txt = soup.select('.fanye .txt')[0].text
    # 取出“共**页”中间的数字
    num = re.search(r'\d+', txt).group(0)
    return num
 
# 获取详细链接
def getLink(tex):
    soup=BeautifulSoup(text,'lxml')
    links=soup.select('.title a')
    for link in links:
        href=parse.urljoin('https://zz.zu.fang.com/',link['href'])
        hrefs.append(href)
 
# 解析页面
def parsePage(url):
    res=session.get(url)
    if res.status_code==200:
        res.encoding=res.apparent_encoding
        soup=BeautifulSoup(res.text,'lxml')
        try:
            title=soup.select('div .title')[0].text.strip().replace(' ','')
            price=soup.select('div .trl-item')[0].text.strip()
            block=soup.select('.rcont #agantzfxq_C02_08')[0].text.strip()
            building=soup.select('.rcont #agantzfxq_C02_07')[0].text.strip()
            try:
                address=soup.select('.trl-item2 .rcont')[2].text.strip()
            except:
                address=soup.select('.trl-item2 .rcont')[1].text.strip()
            detail1=soup.select('.clearfix')[4].text.strip().replace('\n\n\n',',').replace('\n','')
            detail2=soup.select('.clearfix')[5].text.strip().replace('\n\n\n',',').replace('\n','')
            detail=detail1+detail2
            name=soup.select('.zf_jjname')[0].text.strip()
            buserid=re.search('buserid: \'(\d+)\'',res.text).group(1)
            phone=getPhone(buserid)
            print(title,price,block,building,address,detail,name,phone)
            house = (title, price, block, building, address, detail, name, phone)
            info.append(house)
        except:
            pass
    else:
        print(re.status_code,re.text)
 
# 获取代理人号码
def getPhone(buserid):
    url='https://zz.zu.fang.com/RentDetails/Ajax/GetAgentVirtualMobile.aspx'
    data['agentbid']=buserid
    res=session.post(url,data=data)
    if res.status_code==200:
        return res.text
    else:
        print(res.status_code)
        return
 
if __name__ == '__main__':
    start_time=time.time()
    hrefs=[]
    info=[]
    init_url = 'https://zz.zu.fang.com/house/'
    num=getNum(getHtml(init_url))
    for i in range(0,num):
        url = f'https://zz.zu.fang.com/house/i3{i+1}/'
        text=getHtml(url)
        getLink(text)
    print(hrefs)
    for href in hrefs:
        parsePage(href)
 
    print("共获取%d条数据"%len(info))
    print("共耗时{}".format(time.time()-start_time))
    session.close()

二、优化为多线程爬虫

# 用session取代requests
# 解析库使用bs4
# 并发库使用concurrent
import requests
# from lxml import etree    # 使用xpath解析
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
from urllib import parse
import re
import time
 
headers = {
    'referer': 'https://zz.zu.fang.com/',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
    'cookie': 'global_cookie=ffzvt3kztwck05jm6twso2wjw18kl67hqft; integratecover=1; city=zz; keyWord_recenthousezz=%5b%7b%22name%22%3a%22%e6%96%b0%e5%af%86%22%2c%22detailName%22%3a%22%22%2c%22url%22%3a%22%2fhouse-a014868%2f%22%2c%22sort%22%3a1%7d%2c%7b%22name%22%3a%22%e4%ba%8c%e4%b8%83%22%2c%22detailName%22%3a%22%22%2c%22url%22%3a%22%2fhouse-a014864%2f%22%2c%22sort%22%3a1%7d%2c%7b%22name%22%3a%22%e9%83%91%e4%b8%9c%e6%96%b0%e5%8c%ba%22%2c%22detailName%22%3a%22%22%2c%22url%22%3a%22%2fhouse-a0842%2f%22%2c%22sort%22%3a1%7d%5d; __utma=147393320.427795962.1613371106.1613558547.1613575774.5; __utmc=147393320; __utmz=147393320.1613575774.5.4.utmcsr=zz.fang.com|utmccn=(referral)|utmcmd=referral|utmcct=/; ASP.NET_SessionId=vhrhxr1tdatcc1xyoxwybuwv; g_sourcepage=zf_fy%5Elb_pc; Captcha=4937566532507336644D6557347143746B5A6A6B4A7A48445A422F2F6A51746C67516F31357446573052634562725162316152533247514250736F72775566574A2B33514357304B6976343D; __utmt_t0=1; __utmt_t1=1; __utmt_t2=1; __utmb=147393320.9.10.1613575774; unique_cookie=U_0l0d1ilf1t0ci2rozai9qi24k1pkl9lcmrs*4'
}
data={
    'agentbid':''
}
 
session = requests.session()
session.headers = headers
 
# 获取页面
def getHtml(url):
    res = session.get(url)
    if res.status_code==200:
        res.encoding = res.apparent_encoding
        return res.text
    else:
        print(res.status_code)
 
# 获取页面总数量
def getNum(text):
    soup = BeautifulSoup(text, 'lxml')
    txt = soup.select('.fanye .txt')[0].text
    # 取出“共**页”中间的数字
    num = re.search(r'\d+', txt).group(0)
    return num
 
# 获取详细链接
def getLink(url):
    text=getHtml(url)
    soup=BeautifulSoup(text,'lxml')
    links=soup.select('.title a')
    for link in links:
        href=parse.urljoin('https://zz.zu.fang.com/',link['href'])
        hrefs.append(href)
 
# 解析页面
def parsePage(url):
    res=session.get(url)
    if res.status_code==200:
        res.encoding=res.apparent_encoding
        soup=BeautifulSoup(res.text,'lxml')
        try:
            title=soup.select('div .title')[0].text.strip().replace(' ','')
            price=soup.select('div .trl-item')[0].text.strip()
            block=soup.select('.rcont #agantzfxq_C02_08')[0].text.strip()
            building=soup.select('.rcont #agantzfxq_C02_07')[0].text.strip()
            try:
                address=soup.select('.trl-item2 .rcont')[2].text.strip()
            except:
                address=soup.select('.trl-item2 .rcont')[1].text.strip()
            detail1=soup.select('.clearfix')[4].text.strip().replace('\n\n\n',',').replace('\n','')
            detail2=soup.select('.clearfix')[5].text.strip().replace('\n\n\n',',').replace('\n','')
            detail=detail1+detail2
            name=soup.select('.zf_jjname')[0].text.strip()
            buserid=re.search('buserid: \'(\d+)\'',res.text).group(1)
            phone=getPhone(buserid)
            print(title,price,block,building,address,detail,name,phone)
            house = (title, price, block, building, address, detail, name, phone)
            info.append(house)
        except:
            pass
    else:
        print(re.status_code,re.text)
 
# 获取代理人号码
def getPhone(buserid):
    url='https://zz.zu.fang.com/RentDetails/Ajax/GetAgentVirtualMobile.aspx'
    data['agentbid']=buserid
    res=session.post(url,data=data)
    if res.status_code==200:
        return res.text
    else:
        print(res.status_code)
        return
 
if __name__ == '__main__':
    start_time=time.time()
    hrefs=[]
    info=[]
    init_url = 'https://zz.zu.fang.com/house/'
    num=getNum(getHtml(init_url))
    with ThreadPoolExecutor(max_workers=5) as t:
        for i in range(0,num):
            url = f'https://zz.zu.fang.com/house/i3{i+1}/'
            t.submit(getLink,url)
    print("共获取%d个链接"%len(hrefs))
    print(hrefs)
    with ThreadPoolExecutor(max_workers=30) as t:
        for href in hrefs:
            t.submit(parsePage,href)
    print("共获取%d条数据"%len(info))
    print("耗时{}".format(time.time()-start_time))
    session.close()

三、使用asyncio进一步优化

# 用session取代requests
# 解析库使用bs4
# 并发库使用concurrent
import requests
# from lxml import etree    # 使用xpath解析
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
from urllib import parse
import re
import time
import asyncio
 
headers = {
    'referer': 'https://zz.zu.fang.com/',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
    'cookie': 'global_cookie=ffzvt3kztwck05jm6twso2wjw18kl67hqft; integratecover=1; city=zz; keyWord_recenthousezz=%5b%7b%22name%22%3a%22%e6%96%b0%e5%af%86%22%2c%22detailName%22%3a%22%22%2c%22url%22%3a%22%2fhouse-a014868%2f%22%2c%22sort%22%3a1%7d%2c%7b%22name%22%3a%22%e4%ba%8c%e4%b8%83%22%2c%22detailName%22%3a%22%22%2c%22url%22%3a%22%2fhouse-a014864%2f%22%2c%22sort%22%3a1%7d%2c%7b%22name%22%3a%22%e9%83%91%e4%b8%9c%e6%96%b0%e5%8c%ba%22%2c%22detailName%22%3a%22%22%2c%22url%22%3a%22%2fhouse-a0842%2f%22%2c%22sort%22%3a1%7d%5d; __utma=147393320.427795962.1613371106.1613558547.1613575774.5; __utmc=147393320; __utmz=147393320.1613575774.5.4.utmcsr=zz.fang.com|utmccn=(referral)|utmcmd=referral|utmcct=/; ASP.NET_SessionId=vhrhxr1tdatcc1xyoxwybuwv; g_sourcepage=zf_fy%5Elb_pc; Captcha=4937566532507336644D6557347143746B5A6A6B4A7A48445A422F2F6A51746C67516F31357446573052634562725162316152533247514250736F72775566574A2B33514357304B6976343D; __utmt_t0=1; __utmt_t1=1; __utmt_t2=1; __utmb=147393320.9.10.1613575774; unique_cookie=U_0l0d1ilf1t0ci2rozai9qi24k1pkl9lcmrs*4'
}
data={
    'agentbid':''
}
 
session = requests.session()
session.headers = headers
 
# 获取页面
def getHtml(url):
    res = session.get(url)
    if res.status_code==200:
        res.encoding = res.apparent_encoding
        return res.text
    else:
        print(res.status_code)
 
# 获取页面总数量
def getNum(text):
    soup = BeautifulSoup(text, 'lxml')
    txt = soup.select('.fanye .txt')[0].text
    # 取出“共**页”中间的数字
    num = re.search(r'\d+', txt).group(0)
    return num
 
# 获取详细链接
def getLink(url):
    text=getHtml(url)
    soup=BeautifulSoup(text,'lxml')
    links=soup.select('.title a')
    for link in links:
        href=parse.urljoin('https://zz.zu.fang.com/',link['href'])
        hrefs.append(href)
 
# 解析页面
def parsePage(url):
    res=session.get(url)
    if res.status_code==200:
        res.encoding=res.apparent_encoding
        soup=BeautifulSoup(res.text,'lxml')
        try:
            title=soup.select('div .title')[0].text.strip().replace(' ','')
            price=soup.select('div .trl-item')[0].text.strip()
            block=soup.select('.rcont #agantzfxq_C02_08')[0].text.strip()
            building=soup.select('.rcont #agantzfxq_C02_07')[0].text.strip()
            try:
                address=soup.select('.trl-item2 .rcont')[2].text.strip()
            except:
                address=soup.select('.trl-item2 .rcont')[1].text.strip()
            detail1=soup.select('.clearfix')[4].text.strip().replace('\n\n\n',',').replace('\n','')
            detail2=soup.select('.clearfix')[5].text.strip().replace('\n\n\n',',').replace('\n','')
            detail=detail1+detail2
            name=soup.select('.zf_jjname')[0].text.strip()
            buserid=re.search('buserid: \'(\d+)\'',res.text).group(1)
            phone=getPhone(buserid)
            print(title,price,block,building,address,detail,name,phone)
            house = (title, price, block, building, address, detail, name, phone)
            info.append(house)
        except:
            pass
    else:
        print(re.status_code,re.text)
 
# 获取代理人号码
def getPhone(buserid):
    url='https://zz.zu.fang.com/RentDetails/Ajax/GetAgentVirtualMobile.aspx'
    data['agentbid']=buserid
    res=session.post(url,data=data)
    if res.status_code==200:
        return res.text
    else:
        print(res.status_code)
        return
 
# 获取详细链接的线程池
async def Pool1(num):
    loop=asyncio.get_event_loop()
    task=[]
    with ThreadPoolExecutor(max_workers=5) as t:
        for i in range(0,num):
            url = f'https://zz.zu.fang.com/house/i3{i+1}/'
            task.append(loop.run_in_executor(t,getLink,url))
 
# 解析页面的线程池
async def Pool2(hrefs):
    loop=asyncio.get_event_loop()
    task=[]
    with ThreadPoolExecutor(max_workers=30) as t:
        for href in hrefs:
            task.append(loop.run_in_executor(t,parsePage,href))
 
if __name__ == '__main__':
    start_time=time.time()
    hrefs=[]
    info=[]
    task=[]
    init_url = 'https://zz.zu.fang.com/house/'
    num=getNum(getHtml(init_url))
    loop = asyncio.get_event_loop()
    loop.run_until_complete(Pool1(num))
    print("共获取%d个链接"%len(hrefs))
    print(hrefs)
    loop.run_until_complete(Pool2(hrefs))
    loop.close()
    print("共获取%d条数据"%len(info))
    print("耗时{}".format(time.time()-start_time))
    session.close()

四、存入Mysql数据库

(一)建表

from sqlalchemy import create_engine
from sqlalchemy import String, Integer, Column, Text
from sqlalchemy.orm import sessionmaker
from sqlalchemy.orm import scoped_session  # 多线程爬虫时避免出现线程安全问题
from sqlalchemy.ext.declarative import declarative_base
 
BASE = declarative_base()  # 实例化
engine = create_engine(
    "mysql+pymysql://root:root@127.0.0.1:3306/pytest?charset=utf8",
    max_overflow=300,  # 超出连接池大小最多可以创建的连接
    pool_size=100,  # 连接池大小
    echo=False,  # 不显示调试信息
)
 
 
class House(BASE):
    __tablename__ = 'house'
    id = Column(Integer, primary_key=True, autoincrement=True)
    title=Column(String(200))
    price=Column(String(200))
    block=Column(String(200))
    building=Column(String(200))
    address=Column(String(200))
    detail=Column(Text())
    name=Column(String(20))
    phone=Column(String(20))
 
 
BASE.metadata.create_all(engine)
Session = sessionmaker(engine)
sess = scoped_session(Session)

(二)将数据存入数据库中

# 用session取代requests
# 解析库使用bs4
# 并发库使用concurrent
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
from urllib import parse
from mysqldb import sess, House
import re
import time
import asyncio
 
headers = {
    'referer': 'https://zz.zu.fang.com/',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
    'cookie': 'global_cookie=ffzvt3kztwck05jm6twso2wjw18kl67hqft; integratecover=1; city=zz; __utmc=147393320; ASP.NET_SessionId=vhrhxr1tdatcc1xyoxwybuwv; __utma=147393320.427795962.1613371106.1613575774.1613580597.6; __utmz=147393320.1613580597.6.5.utmcsr=zz.fang.com|utmccn=(referral)|utmcmd=referral|utmcct=/; __utmt_t0=1; __utmt_t1=1; __utmt_t2=1; Rent_StatLog=c158b2a7-4622-45a9-9e69-dcf6f42cf577; keyWord_recenthousezz=%5b%7b%22name%22%3a%22%e4%ba%8c%e4%b8%83%22%2c%22detailName%22%3a%22%22%2c%22url%22%3a%22%2fhouse-a014864%2f%22%2c%22sort%22%3a1%7d%2c%7b%22name%22%3a%22%e9%83%91%e4%b8%9c%e6%96%b0%e5%8c%ba%22%2c%22detailName%22%3a%22%22%2c%22url%22%3a%22%2fhouse-a0842%2f%22%2c%22sort%22%3a1%7d%2c%7b%22name%22%3a%22%e7%bb%8f%e5%bc%80%22%2c%22detailName%22%3a%22%22%2c%22url%22%3a%22%2fhouse-a014871%2f%22%2c%22sort%22%3a1%7d%5d; g_sourcepage=zf_fy%5Elb_pc; Captcha=6B65716A41454739794D666864397178613772676C75447A4E746C657144775A347A6D42554F446532357649643062344F6976756E563450554E59594B7833712B413579506C4B684958343D; unique_cookie=U_0l0d1ilf1t0ci2rozai9qi24k1pkl9lcmrs*14; __utmb=147393320.21.10.1613580597'
}
data={
    'agentbid':''
}
 
session = requests.session()
session.headers = headers
 
# 获取页面
def getHtml(url):
    res = session.get(url)
    if res.status_code==200:
        res.encoding = res.apparent_encoding
        return res.text
    else:
        print(res.status_code)
 
# 获取页面总数量
def getNum(text):
    soup = BeautifulSoup(text, 'lxml')
    txt = soup.select('.fanye .txt')[0].text
    # 取出“共**页”中间的数字
    num = re.search(r'\d+', txt).group(0)
    return num
 
# 获取详细链接
def getLink(url):
    text=getHtml(url)
    soup=BeautifulSoup(text,'lxml')
    links=soup.select('.title a')
    for link in links:
        href=parse.urljoin('https://zz.zu.fang.com/',link['href'])
        hrefs.append(href)
 
# 解析页面
def parsePage(url):
    res=session.get(url)
    if res.status_code==200:
        res.encoding=res.apparent_encoding
        soup=BeautifulSoup(res.text,'lxml')
        try:
            title=soup.select('div .title')[0].text.strip().replace(' ','')
            price=soup.select('div .trl-item')[0].text.strip()
            block=soup.select('.rcont #agantzfxq_C02_08')[0].text.strip()
            building=soup.select('.rcont #agantzfxq_C02_07')[0].text.strip()
            try:
                address=soup.select('.trl-item2 .rcont')[2].text.strip()
            except:
                address=soup.select('.trl-item2 .rcont')[1].text.strip()
            detail1=soup.select('.clearfix')[4].text.strip().replace('\n\n\n',',').replace('\n','')
            detail2=soup.select('.clearfix')[5].text.strip().replace('\n\n\n',',').replace('\n','')
            detail=detail1+detail2
            name=soup.select('.zf_jjname')[0].text.strip()
            buserid=re.search('buserid: \'(\d+)\'',res.text).group(1)
            phone=getPhone(buserid)
            print(title,price,block,building,address,detail,name,phone)
            house = (title, price, block, building, address, detail, name, phone)
            info.append(house)
            try:
                house_data=House(
                    title=title,
                    price=price,
                    block=block,
                    building=building,
                    address=address,
                    detail=detail,
                    name=name,
                    phone=phone
                )
                sess.add(house_data)
                sess.commit()
            except Exception as e:
                print(e)    # 打印错误信息
                sess.rollback()  # 回滚
        except:
            pass
    else:
        print(re.status_code,re.text)
 
# 获取代理人号码
def getPhone(buserid):
    url='https://zz.zu.fang.com/RentDetails/Ajax/GetAgentVirtualMobile.aspx'
    data['agentbid']=buserid
    res=session.post(url,data=data)
    if res.status_code==200:
        return res.text
    else:
        print(res.status_code)
        return
 
# 获取详细链接的线程池
async def Pool1(num):
    loop=asyncio.get_event_loop()
    task=[]
    with ThreadPoolExecutor(max_workers=5) as t:
        for i in range(0,num):
            url = f'https://zz.zu.fang.com/house/i3{i+1}/'
            task.append(loop.run_in_executor(t,getLink,url))
 
# 解析页面的线程池
async def Pool2(hrefs):
    loop=asyncio.get_event_loop()
    task=[]
    with ThreadPoolExecutor(max_workers=30) as t:
        for href in hrefs:
            task.append(loop.run_in_executor(t,parsePage,href))
 
if __name__ == '__main__':
    start_time=time.time()
    hrefs=[]
    info=[]
    task=[]
    init_url = 'https://zz.zu.fang.com/house/'
    num=getNum(getHtml(init_url))
    loop = asyncio.get_event_loop()
    loop.run_until_complete(Pool1(num))
    print("共获取%d个链接"%len(hrefs))
    print(hrefs)
    loop.run_until_complete(Pool2(hrefs))
    loop.close()
    print("共获取%d条数据"%len(info))
    print("耗时{}".format(time.time()-start_time))
    session.close()

五、最终效果图 (已打码)

到此这篇关于Python爬取城市租房信息实战分享的文章就介绍到这了,更多相关Python爬取租房信息内容请搜索我们以前的文章或继续浏览下面的相关文章希望大家以后多多支持我们!

(0)

相关推荐

  • 用python爬取租房网站信息的代码

    自己在刚学习python时写的,中途遇到很多问题,查了很多资料,下面就是我爬取租房信息的代码: 链家的房租网站 两个导入的包 1.requests 用来过去网页内容 2.BeautifulSoup import time import pymssql import requests from bs4 import BeautifulSoup # https://wh.lianjia.com/zufang/ #获取url中下面的内容 def get_page(url): responce = re

  • 记一次python 爬虫爬取深圳租房信息的过程及遇到的问题

    为了分析深圳市所有长租.短租公寓的信息,爬取了某租房公寓网站上深圳区域所有在租公寓信息,以下记录了爬取过程以及爬取过程中遇到的问题: 爬取代码: import requests from requests.exceptions import RequestException from pyquery import PyQuery as pq from bs4 import BeautifulSoup import pymongo from config import * from multipr

  • python爬虫 爬取58同城上所有城市的租房信息详解

    代码如下 from fake_useragent import UserAgent from lxml import etree import requests, os import time, re, datetime import base64, json, pymysql from fontTools.ttLib import TTFont ua = UserAgent() class CustomException(Exception): def __init__(self, statu

  • Python爬取城市租房信息实战分享

    目录 一.单线程爬虫 二.优化为多线程爬虫 三.使用asyncio进一步优化 四.存入Mysql数据库 (一)建表 (二)将数据存入数据库中 五.最终效果图 (已打码) 思路:先单线程爬虫,测试可以成功爬取之后再优化为多线程,最后存入数据库 以爬取郑州市租房信息为例 注意:本实战项目仅以学习为目的,为避免给网站造成太大压力,请将代码中的num修改成较小的数字,并将线程改小 一.单线程爬虫 # 用session取代requests # 解析库使用bs4 # 并发库使用concurrent impo

  • python爬取哈尔滨天气信息

    本文实例为大家分享了python爬取哈尔滨天气信息的具体代码,供大家参考,具体内容如下 环境: windows7 python3.4(pip install requests:pip install BeautifulSoup4) 代码: (亲测可以正确执行) # coding:utf-8 """ 总结一下,从网页上抓取内容大致分3步: 1.模拟浏览器访问,获取html源代码 2.通过正则匹配,获取指定标签中的内容 3.将获取到的内容写到文件中 ""&qu

  • 利用python爬取城市公交站点

    目录 页面分析 爬虫 数据清洗 Excel PQ 数据清洗 python数据清洗 QGIS坐标纠偏 导入csv文件 坐标纠偏 总结 利用python爬取城市公交站点 页面分析 https://guiyang.8684.cn/line1 爬虫 我们利用requests请求,利用BeautifulSoup来解析,获取我们的站点数据.得到我们的公交站点以后,我们利用高德api来获取站点的经纬度坐标,利用pandas解析json文件.接下来开干,我推荐使用面向对象的方法来写代码. import requ

  • Python爬取附近餐馆信息代码示例

    本代码主要实现抓取大众点评网中关村附近的餐馆有哪些,具体如下: import urllib.request import re def fetchFood(url): # 模拟使用浏览器浏览大众点评的方式浏览大众点评 headers = {'User-Agent', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'} ope

  • Python爬取豆瓣视频信息代码实例

    这篇文章主要介绍了Python爬取豆瓣视频信息代码实例,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下 这里是爬取豆瓣视频信息,用pyquery库(jquery的python库). 一:代码 from urllib.request import quotefrom pyquery import PyQuery as pqimport requestsimport pandas as pddef get_text_page (movie_name)

  • Python爬取京东商品信息评论存并进MySQL

    目录 构建mysql数据表 第一版: 第二版 : 第三版: 构建mysql数据表 问题:使用SQL alchemy时,非主键不能设置为自增长,但是我想让这个非主键仅仅是为了作为索引,autoincrement=True无效,该怎么实现让它自增长呢? from sqlalchemy import String,Integer,Text,Column from sqlalchemy import create_engine from sqlalchemy.orm import sessionmake

  • Python爬取阿拉丁统计信息过程图解

    背景 目前项目在移动端上,首推使用微信小程序.各项目的小程序访问数据有必要进行采集入库,方便后续做统计分析.虽然阿拉丁后台也提供了趋势分析等功能,但一个个的获取数据做数据分析是很痛苦的事情.通过将数据转换成sql持久化到数据库上,为后面的数据分析和展示提供了基础. 实现思路 阿拉丁产品分开放平台和统计平台两个产品线,目前开放平台有api及配套的文档.统计平台api需要收费,而且贼贵.既然没有现成的api可以获取数据,那么我们尝试一下用python抓取页面上的数据,毕竟python擅长干这种事情.

  • python爬取本站电子书信息并入库的实现代码

    入门级爬虫:只抓取书籍名称,信息及下载地址并存储到数据库 数据库工具类:DBUtil.py import pymysql class DBUtils(object): def connDB(self): #连接数据库 conn=pymysql.connect(host='192.168.251.114',port=3306, user='root',passwd='b6f3g2',db='yangsj',charset='utf8'); cur=conn.cursor(); return (co

  • Python爬取12306车次信息代码详解

    详情查看下面的代码: 如果被识别就要添加一个cookie如果没有被识别的话就要一个user-agent就好了.如果出现乱码就设置编码格式为utf-8 #静态的数据一般在elements中(复制文字到sources按ctrl+f搜索.找到的为静态),而动态去network中去寻找相关的信息 import requests import re def send_request(): headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win6

随机推荐