python使用html2text库实现从HTML转markdown的方法详解
如果PyPi上搜html2text的话,找到的是另外一个库:Alir3z4/html2text。这个库是从aaronsw/html2text fork过来,并在此基础上对功能进行了扩展。因此是直接用pip安装的,因此本文主要来讲讲这个库。
首先,进行安装:
pip install html2text
命令行方式使用html2text
安装完后,就可以通过命令html2text进行一系列的操作了。
html2text命令使用方式为:html2text [(filename|url) [encoding]]。通过html2text -h,我们可以查看该命令支持的选项:
具体使用如下:
# 传递url html2text http://eepurl.com/cK06Gn # 传递文件名,编码方式设置为utf-8 html2text test.html utf-8
脚本中使用html2text
除了直接通过命令行使用html2text外,我们还可以在脚本中将其作为库导入。
我们以以下html文本为例
html_content = """ <span style="font-size:14px"><a href="http://blog.yhat.com/posts/visualize-nba-pipelines.html" rel="external nofollow" target="_blank" style="color: #1173C7;text-decoration: underline;font-weight: bold;">Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA Data</a></span><br> A tutorial using pandas and a few other packages to build a simple datapipe for getting NBA data. Even though this tutorial is done using NBA data, you don't need to be an NBA fan to follow along. The same concepts and techniques can be applied to any project of your choosing.<br> """
一句话转换html文本为Markdown格式的文本:
import html2text print html2text.html2text(html_content)
输出如下:
[Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA
Data](http://blog.yhat.com/posts/visualize-nba-pipelines.html)
A tutorial using pandas and a few other packages to build a simple datapipe
for getting NBA data. Even though this tutorial is done using NBA data, you
don't need to be an NBA fan to follow along. The same concepts and techniques
can be applied to any project of your choosing.
另外,还可以使用上面的配置项:
import html2text h = html2text.HTML2Text() print h.handle(html_content) # 输出同上
注意:下面仅展示使用某个配置项时的输出,不使用某个配置项时使用默认值的输出(如无特殊说明)同上。
--ignore-emphasis
指定选项–ignore-emphasis
h.ignore_emphasis = True print h.handle("<p>hello, this is <em>Ele</em></p>")
输出为:
hello, this is Ele
不指定选项–ignore-emphasis
h.ignore_emphasis = False # 默认值 print h.handle("<p>hello, this is <em>Ele</em></p>")
输出为:
hello, this is _Ele_
--reference-links
h.inline_links = False print h.handle(html_content)
输出为:
[Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA
Data][16]
A tutorial using pandas and a few other packages to build a simple datapipe
for getting NBA data. Even though this tutorial is done using NBA data, you
don't need to be an NBA fan to follow along. The same concepts and techniques
can be applied to any project of your choosing.
[16]: http://blog.yhat.com/posts/visualize-nba-pipelines.html
--ignore-links
h.ignore_links = True print h.handle(html_content)
输出为:
Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA Data
A tutorial using pandas and a few other packages to build a simple datapipe
for getting NBA data. Even though this tutorial is done using NBA data, you
don't need to be an NBA fan to follow along. The same concepts and techniques
can be applied to any project of your choosing.
--protect-links
h.protect_links = True print h.handle(html_content)
输出为:
[Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA
Data](<http://blog.yhat.com/posts/visualize-nba-pipelines.html>)
A tutorial using pandas and a few other packages to build a simple datapipe
for getting NBA data. Even though this tutorial is done using NBA data, you
don't need to be an NBA fan to follow along. The same concepts and techniques
can be applied to any project of your choosing.
--ignore-images
h.ignore_images = True print h.handle('<p>This is a img: <img src="https://my.oschina.net/img/hot3.png" style="max-height: 32px; max-width: 32px;" alt="hot3"> ending ...</p>')
输出为:
This is a img: ending ...
--images-to-alt
h.images_to_alt = True print h.handle('<p>This is a img: <img src="https://my.oschina.net/img/hot3.png" style="max-height: 32px; max-width: 32px;" alt="hot3"> ending ...</p>')
输出为:
This is a img: hot3 ending ...
--images-with-size
h.images_with_size = True print h.handle('<p>This is a img: <img src="https://my.oschina.net/img/hot3.png" height=32px width=32px alt="hot3"> ending ...</p>')
输出为:
This is a img: <img src='https://my.oschina.net/img/hot3.png' width='32px'
height='32px' alt='hot3' /> ending ...
--body-width
h.body_width=0 print h.handle(html_content)
输出为:
[Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA Data](http://blog.yhat.com/posts/visualize-nba-pipelines.html)
A tutorial using pandas and a few other packages to build a simple datapipe for getting NBA data. Even though this tutorial is done using NBA data, you don't need to be an NBA fan to follow along. The same concepts and techniques can be applied to any project of your choosing.
--mark-code
h.mark_code=True print h.handle('<pre class="hljs css"><code class="hljs css"> <span class="hljs-selector-tag"><span class="hljs-selector-tag">rpm</span></span> <span class="hljs-selector-tag"><span class="hljs-selector-tag">-Uvh</span></span> <span class="hljs-selector-tag"><span class="hljs-selector-tag">erlang-solutions-1</span></span><span class="hljs-selector-class"><span class="hljs-selector-class">.0-1</span></span><span class="hljs-selector-class"><span class="hljs-selector-class">.noarch</span></span><span class="hljs-selector-class"><span class="hljs-selector-class">.rpm</span></span></code></pre>')
输出为:
rpm -Uvh erlang-solutions-1.0-1.noarch.rpm
通过这种方式,就可以以脚本的形式自定义HTML -> MARKDOWN的自动化过程了。例子可参考下面的例子
#-*- coding: utf-8 -*- import sys reload(sys) sys.setdefaultencoding('utf-8') import re import requests from lxml import etree import html2text # 获取第一个issue def get_first_issue(url): resp = requests.get(url) page = etree.HTML(resp.text) issue_list = page.xpath("//ul[@id='archive-list']/div[@class='display_archive']/li/a") fst_issue = issue_list[0].attrib fst_issue["text"] = issue_list[0].text return fst_issue # 获取issue的内容,并转成markdown def get_issue_md(url): resp = requests.get(url) page = etree.HTML(resp.text) content = page.xpath("//table[@id='templateBody']")[0]#'//table[@class="bodyTable"]')[0] h = html2text.HTML2Text() h.body_width=0 # 不自动换行 return h.handle(etree.tostring(content)) subtitle_mapping = { '**From Our Sponsor**': '# 来自赞助商', '**News**': '# 新闻', '**Articles**,** Tutorials and Talks**': '# 文章,教程和讲座', '**Books**': '# 书籍', '**Interesting Projects, Tools and Libraries**': '# 好玩的项目,工具和库', '**Python Jobs of the Week**': '# 本周的Python工作', '**New Releases**': '# 最新发布', '**Upcoming Events and Webinars**': '# 近期活动和网络研讨会', } def clean_issue(content): # 去除‘Share Python Weekly'及后面部分 content = re.sub('\*\*Share Python Weekly.*', '', content, flags=re.IGNORECASE) # 预处理标题 for k, v in subtitle_mapping.items(): content = content.replace(k, v) return content tpl_str = """原文:[{title}]({url}) --- {content} """ def run(): issue_list_url = "https://us2.campaign-archive.com/home/?u=e2e180baf855ac797ef407fc7&id=9e26887fc5" print "开始获取最新的issue……" fst = get_first_issue(issue_list_url) #fst = {'href': 'http://eepurl.com/dqpDyL', 'title': 'Python Weekly - Issue 341'} print "获取完毕。开始截取最新的issue内容并将其转换成markdown格式" content = get_issue_md(fst['href']) print "开始清理issue内容" content = clean_issue(content) print "清理完毕,准备将", fst['title'], "写入文件" title = fst['title'].replace('- ', '').replace(' ', '_') with open(title.strip()+'.md', "wb") as f: f.write(tpl_str.format(title=fst['title'], url=fst['href'], content=content)) print "恭喜,完成啦。文件保存至%s.md" % title if __name__ == '__main__': run()
这是一个每周跑一次的python weekly转markdown的脚本。
好啦,html2text就介绍到这里了。如果觉得它还不能满足你的要求,或者想添加更多的功能,可以fork并自行修改。