मैं किसी एकल स्क्रेपी परियोजना में विभिन्न मकड़ियों के लिए विभिन्न पाइपलाइनों का उपयोग कैसे कर सकता हूं

Question 1

मेरे पास एक स्क्रैपी प्रोजेक्ट है जिसमें कई स्पाइडर हैं। क्या कोई तरीका है जो मैं परिभाषित कर सकता हूं कि किस मकड़ी के लिए कौन सी पाइपलाइनों का उपयोग करना है? मेरे द्वारा परिभाषित सभी पाइपलाइन हर मकड़ी के लिए लागू नहीं हैं।

धन्यवाद

Question 2

पाब्लो हॉफमैन से समाधान पर निर्माण , आप process_itemएक पाइपलाइन ऑब्जेक्ट की विधि पर निम्न डेकोरेटर का उपयोग कर सकते हैं ताकि यह pipelineआपके मकड़ी की विशेषता की जांच करे कि क्या इसे निष्पादित किया जाना चाहिए या नहीं। उदाहरण के लिए:

def check_spider_pipeline(process_item_method):

    @functools.wraps(process_item_method)
    def wrapper(self, item, spider):

        # message template for debugging
        msg = '%%s %s pipeline step' % (self.__class__.__name__,)

        # if class is in the spider's pipeline, then use the
        # process_item method normally.
        if self.__class__ in spider.pipeline:
            spider.log(msg % 'executing', level=log.DEBUG)
            return process_item_method(self, item, spider)

        # otherwise, just return the untouched item (skip this step in
        # the pipeline)
        else:
            spider.log(msg % 'skipping', level=log.DEBUG)
            return item

    return wrapper

इस डेकोरेटर के सही ढंग से काम करने के लिए, मकड़ी के पास पाइप लाइन ऑब्जेक्ट्स के एक कंटेनर के साथ एक पाइपलाइन विशेषता होनी चाहिए जिसे आप आइटम को संसाधित करने के लिए उपयोग करना चाहते हैं, उदाहरण के लिए:

class MySpider(BaseSpider):

    pipeline = set([
        pipelines.Save,
        pipelines.Validate,
    ])

    def parse(self, response):
        # insert scrapy goodness here
        return item

और फिर एक pipelines.pyफ़ाइल में:

class Save(object):

    @check_spider_pipeline
    def process_item(self, item, spider):
        # do saving here
        return item

class Validate(object):

    @check_spider_pipeline
    def process_item(self, item, spider):
        # do validating here
        return item

सभी पाइपलाइन ऑब्जेक्ट्स को अभी भी सेटिंग्स में ITEM_PIPELINES में परिभाषित किया जाना चाहिए (सही क्रम में - बदलने के लिए अच्छा होगा ताकि ऑर्डर को स्पाइडर पर भी निर्दिष्ट किया जा सके)।

Question 3

बस मुख्य सेटिंग्स से सभी पाइपलाइनों को हटा दें और इसका उपयोग मकड़ी के अंदर करें।

यह प्रति स्पाइडर उपयोगकर्ता को पाइपलाइन को परिभाषित करेगा

class testSpider(InitSpider):
    name = 'test'
    custom_settings = {
        'ITEM_PIPELINES': {
            'app.MyPipeline': 400
        }
    }

Question 4

यहां दिए गए अन्य समाधान अच्छे हैं, लेकिन मुझे लगता है कि वे धीमे हो सकते हैं, क्योंकि हम वास्तव में प्रति स्पाइडर पाइप लाइन का उपयोग नहीं कर रहे हैं, इसके बजाय हम जाँच कर रहे हैं कि क्या कोई पाइपलाइन हर बार एक आइटम के वापस आने पर मौजूद है (और कुछ मामलों में यह पहुंच सकता है) लाखों)।

स्पाइडर प्रति एक सुविधा को पूरी तरह से अक्षम (या सक्षम) करने का एक अच्छा तरीका है custom_settingऔर from_crawlerइस तरह सभी एक्सटेंशन के लिए:

pipelines.py

from scrapy.exceptions import NotConfigured

class SomePipeline(object):
    def __init__(self):
        pass

    @classmethod
    def from_crawler(cls, crawler):
        if not crawler.settings.getbool('SOMEPIPELINE_ENABLED'):
            # if this isn't specified in settings, the pipeline will be completely disabled
            raise NotConfigured
        return cls()

    def process_item(self, item, spider):
        # change my item
        return item

settings.py

ITEM_PIPELINES = {
   'myproject.pipelines.SomePipeline': 300,
}
SOMEPIPELINE_ENABLED = True # you could have the pipeline enabled by default

spider1.py

class Spider1(Spider):

    name = 'spider1'

    start_urls = ["http://example.com"]

    custom_settings = {
        'SOMEPIPELINE_ENABLED': False
    }

जैसा कि आप जांच करते हैं, हमने निर्दिष्ट किया है custom_settingsकि इसमें निर्दिष्ट चीजों को ओवरराइड किया जाएगा settings.py, और हम SOMEPIPELINE_ENABLEDइस मकड़ी के लिए अक्षम कर रहे हैं ।

अब जब आप इस मकड़ी को चलाते हैं, तो कुछ इस तरह की जाँच करें:

[scrapy] INFO: Enabled item pipelines: []

अब स्क्रैप ने पूरी तरह से पाइपलाइन को निष्क्रिय कर दिया है, पूरे रन के लिए इसके अस्तित्व को परेशान नहीं करता है। जाँच करें कि यह भी स्क्रैप के लिए काम करता है extensionsऔर middlewares।

Question 5

मैं कम से कम चार दृष्टिकोणों के बारे में सोच सकता हूं:

मकड़ियों + पाइपलाइनों के सेट के अनुसार एक अलग स्क्रैपी प्रोजेक्ट का उपयोग करें (उपयुक्त हो सकता है यदि आपके मकड़ियों अलग-अलग परियोजनाओं में पर्याप्त वारंट हैं)
स्क्रैपी टूल कमांड लाइन पर, scrapy settingsअपने मकड़ी के प्रत्येक आह्वान के बीच पाइप लाइन सेटिंग को बदलें
अपने मकड़ियों को अपने स्वयं के स्क्रैप किए गए टूल कमांड में अलग करें , और उस default_settings['ITEM_PIPELINES']कमांड के लिए अपने कमांड क्लास पर उस पाइपलाइन सूची में परिभाषित करें जिसे आप चाहते हैं। इस उदाहरण की लाइन 6 देखें ।
स्वयं पाइपलाइन कक्षाओं में, process_item()जाँच लें कि यह मकड़ी किसके खिलाफ चल रही है, और कुछ भी नहीं है अगर इसे उस मकड़ी के लिए अनदेखा किया जाना चाहिए। आरंभ करने के लिए प्रति मकड़ी प्रति संसाधनों का उपयोग करके उदाहरण देखें । (यह एक बदसूरत समाधान की तरह लगता है क्योंकि यह कपल मकड़ियों और आइटम पाइपलाइनों को कस कर देता है। आप शायद इसे इस्तेमाल न करें।)

Question 6

आप nameअपनी पाइपलाइन में मकड़ी की विशेषता का उपयोग कर सकते हैं

class CustomPipeline(object)

    def process_item(self, item, spider)
         if spider.name == 'spider1':
             # do something
             return item
         return item

इस तरह से सभी पाइपलाइनों को परिभाषित करने से आप जो चाहें हासिल कर सकते हैं।

Question 7

आप बस इस तरह मकड़ी के अंदर आइटम पाइपलाइन सेटिंग्स सेट कर सकते हैं:

class CustomSpider(Spider):
    name = 'custom_spider'
    custom_settings = {
        'ITEM_PIPELINES': {
            '__main__.PagePipeline': 400,
            '__main__.ProductPipeline': 300,
        },
        'CONCURRENT_REQUESTS_PER_DOMAIN': 2
    }

फिर मैं लोडर / लौटे आइटम के लिए एक मूल्य जोड़कर एक पाइपलाइन (या यहां तक कि कई पाइपलाइनों का उपयोग कर सकता हूं) को विभाजित करता है जो पहचानता है कि मकड़ी के किस हिस्से ने आइटम भेजा। इस तरह मुझे कोई KeyError अपवाद नहीं मिलेगा और मुझे पता है कि कौन से आइटम उपलब्ध होने चाहिए।

    ...
    def scrape_stuff(self, response):
        pageloader = PageLoader(
                PageItem(), response=response)

        pageloader.add_xpath('entire_page', '/html//text()')
        pageloader.add_value('item_type', 'page')
        yield pageloader.load_item()

        productloader = ProductLoader(
                ProductItem(), response=response)

        productloader.add_xpath('product_name', '//span[contains(text(), "Example")]')
        productloader.add_value('item_type', 'product')
        yield productloader.load_item()

class PagePipeline:
    def process_item(self, item, spider):
        if item['item_type'] == 'product':
            # do product stuff

        if item['item_type'] == 'page':
            # do page stuff

Question 8

सरल लेकिन फिर भी उपयोगी उपाय।

स्पाइडर कोड

    def parse(self, response):
        item = {}
        ... do parse stuff
        item['info'] = {'spider': 'Spider2'}

पाइपलाइन कोड

    def process_item(self, item, spider):
        if item['info']['spider'] == 'Spider1':
            logging.error('Spider1 pipeline works')
        elif item['info']['spider'] == 'Spider2':
            logging.error('Spider2 pipeline works')
        elif item['info']['spider'] == 'Spider3':
            logging.error('Spider3 pipeline works')

आशा है कि यह किसी के लिए कुछ समय बचा सकता है!

Question 9

मैं दो पाइपलाइनों का उपयोग कर रहा हूं, एक छवि डाउनलोड (MyImagesPipeline) के लिए और दूसरा मैंडोडब (MongoPipeline) में डेटा बचाने के लिए।

मान लें कि हमारे पास कई मकड़ियाँ हैं (spider1, spider2, ...........), मेरे उदाहरण में spider1 और spider5 MyImagesPipeline का उपयोग नहीं कर सकते हैं

settings.py

ITEM_PIPELINES = {'scrapycrawler.pipelines.MyImagesPipeline' : 1,'scrapycrawler.pipelines.MongoPipeline' : 2}
IMAGES_STORE = '/var/www/scrapycrawler/dowload'

और पाइप लाइन का पूरा कोड bellow

import scrapy
import string
import pymongo
from scrapy.pipelines.images import ImagesPipeline

class MyImagesPipeline(ImagesPipeline):
    def process_item(self, item, spider):
        if spider.name not in ['spider1', 'spider5']:
            return super(ImagesPipeline, self).process_item(item, spider)
        else:
           return item 

    def file_path(self, request, response=None, info=None):
        image_name = string.split(request.url, '/')[-1]
        dir1 = image_name[0]
        dir2 = image_name[1]
        return dir1 + '/' + dir2 + '/' +image_name

class MongoPipeline(object):

    collection_name = 'scrapy_items'
    collection_url='snapdeal_urls'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'scraping')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        #self.db[self.collection_name].insert(dict(item))
        collection_name=item.get( 'collection_name', self.collection_name )
        self.db[collection_name].insert(dict(item))
        data = {}
        data['base_id'] = item['base_id']
        self.db[self.collection_url].update({
            'base_id': item['base_id']
        }, {
            '$set': {
            'image_download': 1
            }
        }, upsert=False, multi=True)
        return item

Question 10

हम इस रूप में पाइपलाइन में कुछ शर्तों का उपयोग कर सकते हैं

    # -*- coding: utf-8 -*-
from scrapy_app.items import x

class SaveItemPipeline(object):
    def process_item(self, item, spider):
        if isinstance(item, x,):
            item.save()
        return item

Question 11

सबसे सरल और प्रभावी उपाय प्रत्येक मकड़ी में कस्टम सेटिंग्स सेट करना है।

custom_settings = {'ITEM_PIPELINES': {'project_name.pipelines.SecondPipeline': 300}}

उसके बाद आपको उन्हें सेटिंग्स में सेट करना होगा फ़ाइल

ITEM_PIPELINES = {
   'project_name.pipelines.FistPipeline': 300,
   'project_name.pipelines.SecondPipeline': 300
}

उस तरह से प्रत्येक मकड़ी संबंधित पाइपलाइन का उपयोग करेगी।