1. Scrapy 相关细节
- 安装scrapy框架
- pip install scrapy -i https://pypi.douban.com/simple
- 创建scrapy爬虫
- 进入项目文件夹
- cd project_name/project_name/spiders
- 运行命令
- scrapy startproject project_name url
- D:\Practice\Python\Scrapy_20230226\scrapy_carhome\scrapy_carhome\spiders> scrapy genspider car https://car.autohome.com.cn/price/brand-15.html
- 进入项目文件夹
- 运行爬虫
- 进入当前爬虫所在文件夹
- scrapy crawl spider_name
- D:\Practice\Python\Scrapy_20230226\scrapy_carhome\scrapy_carhome\spiders> scrapy crawl car
- 进入当前爬虫所在文件夹
2. Scrapy 数据流程
https://docs.scrapy.org/en/latest/topics/architecture.html
- 引擎得到来自于Spider的url请求
- 引擎调用调度器将待爬取url放入列队中
- 调度器返回请求给引擎
- 引擎发送请求给下载器
- 下载器获取互联网资源, 获取数据后返回给引擎
- 引擎收到来自下载器的响应, 将其发送给Spider, 以进行处理
- Spider返回处理过的数据给引擎(同时也发送下一个新的URL给引擎)
- 引擎发送处理过的数据给Pipelines, 以保存数据
- 重复第三步之后的步骤, 直到没有新的任务
The data flow in Scrapy is controlled by the execution engine, and goes like this:
- The Engine gets the initial Requests to crawl from the Spider.
- The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
- The Scheduler returns the next Requests to the Engine.
- The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see
process_request()
). - Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see
process_response()
). - The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see
process_spider_input()
). - The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see
process_spider_output()
). - The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.
- The process repeats (from step 3) until there are no more requests from the Scheduler.
本站文章除单独注明外均为原创,本文链接https://bowmanjin.com/785,未经允许请勿转载。
请先
!