在使用 Scrapy 进行身份验证时抓取 LinkedIn

Crawling LinkedIn while authenticated with Scrapy

所以我在 Scrapy 中通过经过身份验证的会话通读了 Crawling 并且我被挂断了,我 99% 确定我的解析代码是正确的,我只是不相信登录正在重定向并且正在成功。

我也遇到了 check_login_response() 的问题,不确定它正在检查哪个页面。虽然"退出"是有道理的。

====== 已更新 ======

from scrapy.contrib.spiders.init import InitSpider

from scrapy.http import Request, FormRequest

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy.contrib.spiders import Rule



from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector



from linkedpy.items import LinkedPyItem



class LinkedPySpider(InitSpider):

  name = 'LinkedPy'

  allowed_domains = ['linkedin.com']

  login_page = 'https://www.linkedin.com/uas/login'

  start_urls = ["http://www.linkedin.com/csearch/results?type=companies&keywords=&pplSearchOrigin=GLHD&pageKey=member-home&search=Search#facets=pplSearchOrigin%3DFCTD%26keywords%3D%26search%3DSubmit%26facet_CS%3DC%26facet_I%3D80%26openFacets%3DJO%252CN%252CCS%252CNFR%252CF%252CCCR%252CI"]



  def init_request(self):

    #"""This function is called before crawling starts."""

    return Request(url=self.login_page, callback=self.login)



  def login(self, response):

    #"""Generate a login request."""

    return FormRequest.from_response(response,

          formdata={'session_key': 'user@email.com', 'session_password': 'somepassword'},

          callback=self.check_login_response)



  def check_login_response(self, response):

    #"""Check the response returned by a login request to see if we aresuccessfully logged in."""

    if"Sign Out" in response.body:

      self.log("\

\

\

Successfully logged in. Let's start crawling!\

\

\

")

      # Now the crawling can begin..



      return self.initialized() # ****THIS LINE FIXED THE LAST PROBLEM*****



    else:

      self.log("\

\

\

Failed, Bad times :(\

\

\

")

      # Something went wrong, we couldn't log in, so nothing happens.



  def parse(self, response):

    self.log("\

\

\

We got data! \

\

\

")

    hxs = HtmlXPathSelector(response)

    sites = hxs.select('//ol[@id=\'result-set\']/li')

    items = []

    for site in sites:

      item = LinkedPyItem()

      item['title'] = site.select('h2/a/text()').extract()

      item['link'] = site.select('h2/a/@href').extract()

      items.append(item)

    return items
class LinkedPySpider(BaseSpider):
class LinkedPySpider(InitSpider):

通过在 self.initialized() 前面添加 \\'Return\\' 解决了这个问题

再次感谢!

-马克


from scrapy.contrib.spiders.init import InitSpider

from scrapy.http import Request, FormRequest

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy.contrib.spiders import Rule



from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector



from linkedpy.items import LinkedPyItem



class LinkedPySpider(InitSpider):

  name = 'LinkedPy'

  allowed_domains = ['linkedin.com']

  login_page = 'https://www.linkedin.com/uas/login'

  start_urls = ["http://www.linkedin.com/csearch/results?type=companies&keywords=&pplSearchOrigin=GLHD&pageKey=member-home&search=Search#facets=pplSearchOrigin%3DFCTD%26keywords%3D%26search%3DSubmit%26facet_CS%3DC%26facet_I%3D80%26openFacets%3DJO%252CN%252CCS%252CNFR%252CF%252CCCR%252CI"]



  def init_request(self):

    #"""This function is called before crawling starts."""

    return Request(url=self.login_page, callback=self.login)



  def login(self, response):

    #"""Generate a login request."""

    return FormRequest.from_response(response,

          formdata={'session_key': 'user@email.com', 'session_password': 'somepassword'},

          callback=self.check_login_response)



  def check_login_response(self, response):

    #"""Check the response returned by a login request to see if we aresuccessfully logged in."""

    if"Sign Out" in response.body:

      self.log("\

\

\

Successfully logged in. Let's start crawling!\

\

\

")

      # Now the crawling can begin..



      return self.initialized() # ****THIS LINE FIXED THE LAST PROBLEM*****



    else:

      self.log("\

\

\

Failed, Bad times :(\

\

\

")

      # Something went wrong, we couldn't log in, so nothing happens.



  def parse(self, response):

    self.log("\

\

\

We got data! \

\

\

")

    hxs = HtmlXPathSelector(response)

    sites = hxs.select('//ol[@id=\'result-set\']/li')

    items = []

    for site in sites:

      item = LinkedPyItem()

      item['title'] = site.select('h2/a/text()').extract()

      item['link'] = site.select('h2/a/@href').extract()

      items.append(item)

    return items
class LinkedPySpider(BaseSpider):
class LinkedPySpider(InitSpider):

应该是:

from scrapy.contrib.spiders.init import InitSpider

from scrapy.http import Request, FormRequest

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy.contrib.spiders import Rule



from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector



from linkedpy.items import LinkedPyItem



class LinkedPySpider(InitSpider):

  name = 'LinkedPy'

  allowed_domains = ['linkedin.com']

  login_page = 'https://www.linkedin.com/uas/login'

  start_urls = ["http://www.linkedin.com/csearch/results?type=companies&keywords=&pplSearchOrigin=GLHD&pageKey=member-home&search=Search#facets=pplSearchOrigin%3DFCTD%26keywords%3D%26search%3DSubmit%26facet_CS%3DC%26facet_I%3D80%26openFacets%3DJO%252CN%252CCS%252CNFR%252CF%252CCCR%252CI"]



  def init_request(self):

    #"""This function is called before crawling starts."""

    return Request(url=self.login_page, callback=self.login)



  def login(self, response):

    #"""Generate a login request."""

    return FormRequest.from_response(response,

          formdata={'session_key': 'user@email.com', 'session_password': 'somepassword'},

          callback=self.check_login_response)



  def check_login_response(self, response):

    #"""Check the response returned by a login request to see if we aresuccessfully logged in."""

    if"Sign Out" in response.body:

      self.log("\

\

\

Successfully logged in. Let's start crawling!\

\

\

")

      # Now the crawling can begin..



      return self.initialized() # ****THIS LINE FIXED THE LAST PROBLEM*****



    else:

      self.log("\

\

\

Failed, Bad times :(\

\

\

")

      # Something went wrong, we couldn't log in, so nothing happens.



  def parse(self, response):

    self.log("\

\

\

We got data! \

\

\

")

    hxs = HtmlXPathSelector(response)

    sites = hxs.select('//ol[@id=\'result-set\']/li')

    items = []

    for site in sites:

      item = LinkedPyItem()

      item['title'] = site.select('h2/a/text()').extract()

      item['link'] = site.select('h2/a/@href').extract()

      items.append(item)

    return items
class LinkedPySpider(BaseSpider):
class LinkedPySpider(InitSpider):

你也不应该重写我在回答中提到的 parse 函数:https://stackoverflow.com/a/5857202/crawling-with-an-authenticated-session-in-scrapy

如果您不了解如何定义提取链接的规则,请仔细阅读文档:

http://readthedocs.org/docs/scrapy/en/latest/topics/spiders.html#scrapy.contrib.spiders.Rule

http://readthedocs.org/docs/scrapy/en/latest/topics/link-extractors.html#topics-link-extractors


相关推荐

  • Spring部署设置openshift

    Springdeploymentsettingsopenshift我有一个问题让我抓狂了三天。我根据OpenShift帐户上的教程部署了spring-eap6-quickstart代码。我已配置调试选项,并且已将Eclipse工作区与OpehShift服务器同步-服务器上的一切工作正常,但在Eclipse中出现无法消除的错误。我有这个错误:cvc-complex-type.2.4.a:Invali…
    2025-04-161
  • 检查Java中正则表达式中模式的第n次出现

    CheckfornthoccurrenceofpatterninregularexpressioninJava本问题已经有最佳答案,请猛点这里访问。我想使用Java正则表达式检查输入字符串中特定模式的第n次出现。你能建议怎么做吗?这应该可以工作:MatchResultfindNthOccurance(intn,Patternp,CharSequencesrc){Matcherm=p.matcher…
    2025-04-161
  • 如何让 JTable 停留在已编辑的单元格上

    HowtohaveJTablestayingontheeditedcell如果有人编辑JTable的单元格内容并按Enter,则内容会被修改并且表格选择会移动到下一行。是否可以禁止JTable在单元格编辑后转到下一行?原因是我的程序使用ListSelectionListener在单元格选择上同步了其他一些小部件,并且我不想在编辑当前单元格后选择下一行。Enter的默认绑定是名为selectNext…
    2025-04-161
  • Weblogic 12c 部署

    Weblogic12cdeploy我正在尝试将我的应用程序从Tomcat迁移到Weblogic12.2.1.3.0。我能够毫无错误地部署应用程序,但我遇到了与持久性提供程序相关的运行时错误。这是堆栈跟踪:javax.validation.ValidationException:CalltoTraversableResolver.isReachable()threwanexceptionatorg.…
    2025-04-161
  • Resteasy Content-Type 默认值

    ResteasyContent-Typedefaults我正在使用Resteasy编写一个可以返回JSON和XML的应用程序,但可以选择默认为XML。这是我的方法:@GET@Path("/content")@Produces({MediaType.APPLICATION_XML,MediaType.APPLICATION_JSON})publicStringcontentListRequestXm…
    2025-04-161
  • 代码不会停止运行,在 Java 中

    thecodedoesn'tstoprunning,inJava我正在用Java解决项目Euler中的问题10,即"Thesumoftheprimesbelow10is2+3+5+7=17.Findthesumofalltheprimesbelowtwomillion."我的代码是packageprojecteuler_1;importjava.math.BigInteger;importjava…
    2025-04-161
  • Out of memory java heap space

    Outofmemoryjavaheapspace我正在尝试将大量文件从服务器发送到多个客户端。当我尝试发送大小为700mb的文件时,它显示了"OutOfMemoryjavaheapspace"错误。我正在使用Netbeans7.1.2版本。我还在属性中尝试了VMoption。但仍然发生同样的错误。我认为阅读整个文件存在一些问题。下面的代码最多可用于300mb。请给我一些建议。提前致谢publicc…
    2025-04-161
  • Log4j 记录到共享日志文件

    Log4jLoggingtoaSharedLogFile有没有办法将log4j日志记录事件写入也被其他应用程序写入的日志文件。其他应用程序可以是非Java应用程序。有什么缺点?锁定问题?格式化?Log4j有一个SocketAppender,它将向服务发送事件,您可以自己实现或使用与Log4j捆绑的简单实现。它还支持syslogd和Windows事件日志,这对于尝试将日志输出与来自非Java应用程序…
    2025-04-161