聊聊爬虫背后的技术

技术发展真的是日新月异。当我的记忆还停留在PhantomJS时,却发现PhantomJS早已经停止维护了。(因为chrome等浏览器都推出headless模式,2018-05时作者就已经发issue说将要archive这个项目,2023-05-30 archive了此项目。此项目的最后一个版本是2016-01-25发布的2.1.1版本)

目前实现headless浏览的有两种方式:WebDriver 和 DevTools Protocol。

以chrome为例,可以查阅如下资料:

 

那么,DevTools Protocol 和 WebDriver之间的区别是什么呢?

可以参考:https://stackoverflow.com/questions/50939116/what-is-the-difference-between-webdriver-and-devtool-protocol

Main difference between WebDriver protocol and DevTools protocol is that WebDriver protocol needs a middle man like browser-driver (eg: chrome-driver) which is a server that sits between the automation script and browser enabling browser control, but in case of DevTools protocol the automation script can directly talk to browser running in debug mode making headless automation pretty straight forward.

And Chrome driver internally uses DevTools protocol to control browser, so if we are using WebDriver protocol it will in turn use Devtools protocol to control browser.

If cross-browser testing is something important for the new testing tool, DevTools protocol may not be suitable now, as there is no standard yet and it is mostly work in progress. Otherwise DevTools protocol will be a great choice as it gives more control like intercepting request header, simulating network etc and makes headless automation way easier.

其他:

  • 涉及无头浏览器操作的具体编码上,建议使用Selenium或者Playwright等工具,它们都提供了多种语言的绑定,支持所有主流的浏览器。
  • 至于爬虫框架的话,可以使用scrapy。

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注

*