Scrapy CSS 和 XPath 提取数据

我们可以使用 Scrapy CSS 或者 XPath 表达式，对抓取的网页内容进行分析，提取所需的数据项。CSS 或者 XPath 表达式具有相同的作用，但有着各自不同的语法，其中 XPath 选择器功能更强大，使用也更为广泛。

1. 使用 Scrapy shell 抓取页面

我们使用 Scrapy shell 抓取页面 http://quotes.toscrape.com/page/1/ 进行分析：

scrapy shell 'http://quotes.toscrape.com/page/1/'

我们使用命令行运行 Scrapy shell 时，需要使用引号将 url 括起来。在 Windows 上，要使用双引号：

scrapy shell "http://quotes.toscrape.com/page/1/"

运行结果如下：

<[... Scrapy log here ...]
2016-09-19 12:09:27 [scrapy.core.engine] DEBUG：Crawled（200）<GET http://quotes.toscrape.com/page/1/>（referer：None）
[s]可用Scrapy对象：
[s] scrapy scrapy模块（包含scrapy.Request，scrapy.Selector等）
[s] crawler <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s] item {}
[s] request <GET http://quotes.toscrape.com/page/1/>
[s] response <200 http://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x7fa91d888c10>
[s] spider <DefaultSpider'default'at 0x7fa91c8af990>
[s]有用的快捷键：
[s] shelp（）Shell帮助（打印此帮助）
[s] fetch（req_or_url）Fetch请求（或URL）并更新本地对象
[s] view（response）在浏览器中查看响应
>>>

2. 使用 CSS 选择元素

在 scrapy shell 里，使用响应对象 response 的 css 方法选择元素：

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

css 方法返回一个 Selector 的集合元素。

我们使用响应对象 response 的 css 方法从标题元素中提取文本：

>>> response.css('title::text').extract()
['Quotes to Scrape']

这里有两个要注意的问题：一个是我们添加::text到 CSS 查询，意味着我们要直接在 <title> 元素内部选择文本元素。如果我们不指定 ::text，我们将获得完整的 title 元素，包括其标签：

>>> response.css('title').extract()
['<title>Quotes to Scrape</title>']

另一件问题是调用的 extract()，返回结果是一个列表 SelectorList。如果我们只想要第一个结果，可以使用 extract_first()：

>>> response.css('title::text').extract_first()
'Quotes to Scrape'

也可以这样写：

>>> response.css('title::text')[0].extract()
'Quotes to Scrape'

但是，使用 .extract_first() 避免了 IndexError，并且 None 在找不到与选择匹配的任何元素时返回。

除了 extract() 和 extract_first()方法，您还可以使用该 re() 方法使用正则表达式提取：

>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']

为了找到合适的 CSS 选择器使用，您可以用 Chrome 和 Firefox 的调试工具查看 CSS。

3. 使用 XPath 选择元素

除了 CSS，Scrapy 选择器还支持使用 XPath 表达式：

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').extract_first()
'Quotes to Scrape'

XPath 表达式非常强大，是 Scrapy 选择器的基础。事实上，CSS 选底层也是用 XPath。

虽然也许不像 CSS 选择器那么流行，XPath 表达式提供了更多的功能，因为除了导航结构之外，它还可以查看内容。使用 XPath，您可以选择以下内容：选择包含文本“下一页”的链接。这使得 XPath 非常适合于抓取任务，我们鼓励你学习 XPath。

4. 集成数据提取功能

我们已经学会了一些基本的提取数据方法，现在我们尝试集成到我们上面的创建的爬虫中。

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

如果你运行这个爬虫，它将输出提取的数据与日志：

2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG：Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags'：['life'，'love']，'author'：'AndréGide'，'text'：'“最好不要因为你的爱而被恨。 “'}
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG：Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags'：['edison'，'failure'，'inspirational'，'paraphrased']，'author'：'Thomas A. Edison'，'text'：“”我没有失败， 10,000种方式将无法工作。“”}

Scrapy 爬虫

Scrapy CSS 和 XPath 提取数据

1. 使用 Scrapy shell 抓取页面

2. 使用 CSS 选择元素

3. 使用 XPath 选择元素

4. 集成数据提取功能