FutureWarning: एलिमेंटवाइज़ तुलना विफल रही; स्केलर लौट रहा है, लेकिन भविष्य में तत्वपूर्ण तुलना करेगा

Question 1

मैं 0.19.1पायथन 3 पर पंडों का उपयोग कर रहा हूं। मुझे कोड की इन पंक्तियों पर एक चेतावनी मिल रही है। मैं एक सूची प्राप्त करने की कोशिश कर रहा हूं जिसमें सभी पंक्ति संख्याएं शामिल हैं जहां स्ट्रिंग Peterकॉलम में मौजूद है Unnamed: 5।

df = pd.read_excel(xls_path)
myRows = df[df['Unnamed: 5'] == 'Peter'].index.tolist()

यह एक चेतावनी का उत्पादन करता है:

"\Python36\lib\site-packages\pandas\core\ops.py:792: FutureWarning: elementwise 
comparison failed; returning scalar, but in the future will perform 
elementwise comparison 
result = getattr(x, name)(y)"

यह FutureWarning क्या है और मुझे इसे अनदेखा करना चाहिए क्योंकि यह काम करने लगता है।

Question 2

यह FutureWarning पंडों से नहीं है, यह खस्ता से है और बग भी matplotlib और अन्य लोगों को प्रभावित करता है, यहां बताया गया है कि मुसीबत के स्रोत के पास चेतावनी को कैसे पुन: पेश करें:

import numpy as np
print(np.__version__)   # Numpy version '1.12.0'
'x' in np.arange(5)       #Future warning thrown here

FutureWarning: elementwise comparison failed; returning scalar instead, but in the 
future will perform elementwise comparison
False

डबल बराबर ऑपरेटर का उपयोग करके इस बग को पुन: पेश करने का एक और तरीका:

import numpy as np
np.arange(5) == np.arange(5).astype(str)    #FutureWarning thrown here

इस क्वॉलिटी प्लॉट कार्यान्वयन के तहत इस FutureWarning से प्रभावित Matplotlib का एक उदाहरण: https://matplotlib.org/examples/pylab_examples/quiver_demo.html

यहाँ क्या चल रहा है?

जब आप एक तार के संख्यात्मक प्रकारों की तुलना करते हैं, तो क्या होना चाहिए, इस पर Numpy और देशी अजगर के बीच असहमति है। ध्यान दें कि बाएं ऑपरेंड अजगर का टर्फ है, एक आदिम स्ट्रिंग है, और बीच का ऑपरेशन अजगर का टर्फ है, लेकिन दाहिना ऑपेरेंड सुपी का टर्फ है। क्या आपको पायथन स्टाइल स्केलर या बूलियन की एक नेम्पी स्टाइल नडर्रे लौटानी चाहिए? Numpy कहते हैं कि बूल की ndarray, Pythonic डेवलपर्स असहमत हैं। क्लासिक गतिरोध।

यदि आइटम सरणी में मौजूद है, तो क्या यह तत्ववार तुलना या स्केलर होना चाहिए?

यदि आपका कोड या लाइब्रेरी अजगर के स्ट्रिंग की तुलना खट्टे ndarrays से करने के लिए inया ==ऑपरेटरों का उपयोग कर रहे हैं, तो वे संगत नहीं हैं, इसलिए जब आप इसे आज़माते हैं, तो यह एक स्केलर देता है, लेकिन केवल अब के लिए। चेतावनी इंगित करती है कि भविष्य में यह व्यवहार बदल सकता है इसलिए आपका कोड कालीन पर सभी जगह पेक करता है अगर अजगर / सुन्न ने नम्पी शैली को अपनाने का फैसला किया।

प्रस्तुत बग रिपोर्ट:

Numpy और Python एक गतिरोध में हैं, अभी के लिए ऑपरेशन एक स्केलर देता है, लेकिन भविष्य में यह बदल सकता है।

https://github.com/numpy/numpy/issues/6784

https://github.com/pandas-dev/pandas/issues/7830

दो समाधान समाधान:

या तो अजगर और सुन्न के अपने संस्करण को लॉक करें, चेतावनियों को अनदेखा करें और व्यवहार को बदलने की उम्मीद न करें, या बाएं और दाएं दोनों ऑपरेंड को परिवर्तित करें ==और inएक संख्यात्मक प्रकार या आदिम पायथन संख्यात्मक प्रकार से हो।

विश्व स्तर पर चेतावनी को दबाएं:

import warnings
import numpy as np
warnings.simplefilter(action='ignore', category=FutureWarning)
print('x' in np.arange(5))   #returns False, without Warning

लाइन के आधार पर एक चेतावनी को दबाएं।

import warnings
import numpy as np

with warnings.catch_warnings():
    warnings.simplefilter(action='ignore', category=FutureWarning)
    print('x' in np.arange(2))   #returns False, warning is suppressed

print('x' in np.arange(10))   #returns False, Throws FutureWarning

बस नाम से चेतावनी को दबाएं, फिर इसके आगे एक ज़ोरदार टिप्पणी करें, जिसमें अजगर और सुन्न के वर्तमान संस्करण का उल्लेख किया गया है, यह कहते हुए कि यह कोड भंगुर है और इन संस्करणों की आवश्यकता है और यहां एक लिंक डालें। सड़क को नीचे गिरा सकते हैं।

TLDR: pandas जेडी हैं; numpyहट्स हैं; और pythonगांगेय साम्राज्य है। https://youtu.be/OZczsiCfQQk?t=3

Question 3

मुझे वही त्रुटि मिलती है जब मैं index_colपढ़ने की फाइल को Pandaडेटा-फ्रेम में सेट करने की कोशिश करता हूं :

df = pd.read_csv('my_file.tsv', sep='\t', header=0, index_col=['0'])  ## or same with the following
df = pd.read_csv('my_file.tsv', sep='\t', header=0, index_col=[0])

मुझे ऐसी त्रुटि पहले कभी नहीं आई। मैं अब भी इसके पीछे का कारण जानने की कोशिश कर रहा हूं (@Eric Leschinski स्पष्टीकरण और अन्य का उपयोग करके)।

किसी भी तरह, निम्नलिखित दृष्टिकोण अब तक समस्या का हल करता है जब तक कि मैं कारण का पता नहीं लगाता:

df = pd.read_csv('my_file.tsv', sep='\t', header=0)  ## not setting the index_col
df.set_index(['0'], inplace=True)

जैसे ही मैं इस तरह के व्यवहार का कारण समझूंगा, मैं इसे अपडेट कर दूंगा।

Question 4

उसी चेतावनी संदेश के लिए मेरा अनुभव टाइपर्रर के कारण हुआ।

TypeError: अमान्य प्रकार की तुलना

इसलिए, आप डेटा प्रकार की जांच करना चाहते हैं Unnamed: 5

for x in df['Unnamed: 5']:
  print(type(x))  # are they 'str' ?

यहां बताया गया है कि मैं चेतावनी संदेश को कैसे दोहरा सकता हूं:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 2), columns=['num1', 'num2'])
df['num3'] = 3
df.loc[df['num3'] == '3', 'num3'] = 4  # TypeError and the Warning
df.loc[df['num3'] == 3, 'num3'] = 4  # No Error

आशा है कि इससे सहायता मिलेगी।

Question 5

एरिक लेस्किंस्की के अटपटे विस्तृत उत्तर को नहीं हरा सकते, लेकिन यहां मूल प्रश्न का एक त्वरित समाधान है जो मुझे नहीं लगता कि अभी तक उल्लेख किया गया है - स्ट्रिंग को एक सूची में डालें और .isinइसके बजाय उपयोग करें==

उदाहरण के लिए:

import pandas as pd
import numpy as np

df = pd.DataFrame({"Name": ["Peter", "Joe"], "Number": [1, 2]})

# Raises warning using == to compare different types:
df.loc[df["Number"] == "2", "Number"]

# No warning using .isin:
df.loc[df["Number"].isin(["2"]), "Number"]

Question 6

इसके लिए एक त्वरित समाधान का उपयोग करना है numpy.core.defchararray। मुझे उसी चेतावनी संदेश का भी सामना करना पड़ा और उपरोक्त मॉड्यूल का उपयोग करके इसे हल करने में सक्षम था।

import numpy.core.defchararray as npd
resultdataset = npd.equal(dataset1, dataset2)

Question 7

एरिक का जवाब मदद से समझाता है कि पंडों की श्रृंखला (न्यूमिपी सरणी युक्त) की तुलना पायथन स्ट्रिंग से करने से परेशानी आती है। दुर्भाग्य से, उनके दो वर्कअराउंड दोनों ही चेतावनी को दबा देते हैं।

कोड लिखने के लिए जो पहले से चेतावनी का कारण नहीं बनता है, स्पष्ट रूप से श्रृंखला के प्रत्येक तत्व के लिए अपने स्ट्रिंग की तुलना करें और प्रत्येक के लिए एक अलग बूल प्राप्त करें। उदाहरण के लिए, आप उपयोग कर सकते हैं mapऔर एक अनाम फ़ंक्शन।

myRows = df[df['Unnamed: 5'].map( lambda x: x == 'Peter' )].index.tolist()

Question 8

यदि आपकी सरणियाँ बहुत बड़ी नहीं हैं या आपके पास उनमें से बहुत अधिक नहीं हैं, तो आप ==एक स्ट्रिंग होने के लिए बाएं हाथ की तरफ मजबूर होने के साथ दूर हो सकते हैं :

myRows = df[str(df['Unnamed: 5']) == 'Peter'].index.tolist()

लेकिन यह ~ 1.5 गुना धीमा है अगर df['Unnamed: 5']एक स्ट्रिंग है, तो 25-30 गुना धीमी है अगर df['Unnamed: 5']एक छोटी सी नुकीली सरणी (लंबाई = 10) है, और 150-160 गुना धीमी है अगर यह 100 की लंबाई के साथ एक संख्यात्मक सरणी है (500 से अधिक बार औसतन) ।

a = linspace(0, 5, 10)
b = linspace(0, 50, 100)
n = 500
string1 = 'Peter'
string2 = 'blargh'
times_a = zeros(n)
times_str_a = zeros(n)
times_s = zeros(n)
times_str_s = zeros(n)
times_b = zeros(n)
times_str_b = zeros(n)
for i in range(n):
    t0 = time.time()
    tmp1 = a == string1
    t1 = time.time()
    tmp2 = str(a) == string1
    t2 = time.time()
    tmp3 = string2 == string1
    t3 = time.time()
    tmp4 = str(string2) == string1
    t4 = time.time()
    tmp5 = b == string1
    t5 = time.time()
    tmp6 = str(b) == string1
    t6 = time.time()
    times_a[i] = t1 - t0
    times_str_a[i] = t2 - t1
    times_s[i] = t3 - t2
    times_str_s[i] = t4 - t3
    times_b[i] = t5 - t4
    times_str_b[i] = t6 - t5
print('Small array:')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_a), mean(times_str_a)))
print('Ratio of time with/without string conversion: {}'.format(mean(times_str_a)/mean(times_a)))

print('\nBig array')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_b), mean(times_str_b)))
print(mean(times_str_b)/mean(times_b))

print('\nString')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_s), mean(times_str_s)))
print('Ratio of time with/without string conversion: {}'.format(mean(times_str_s)/mean(times_s)))

परिणाम:

Small array:
Time to compare without str conversion: 6.58464431763e-06 s. With str conversion: 0.000173756599426 s
Ratio of time with/without string conversion: 26.3881526541

Big array
Time to compare without str conversion: 5.44309616089e-06 s. With str conversion: 0.000870866775513 s
159.99474375821288

String
Time to compare without str conversion: 5.89370727539e-07 s. With str conversion: 8.30173492432e-07 s
Ratio of time with/without string conversion: 1.40857605178

Question 9

मेरे मामले में, चेतावनी केवल नियमित प्रकार के बूलियन अनुक्रमण के कारण हुई - क्योंकि श्रृंखला में केवल np.nan था। प्रदर्शन (पांडा 1.0.3):

>>> import pandas as pd
>>> import numpy as np
>>> pd.Series([np.nan, 'Hi']) == 'Hi'
0    False
1     True
>>> pd.Series([np.nan, np.nan]) == 'Hi'
~/anaconda3/envs/ms3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py:255: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  res_values = method(rvalues)
0    False
1    False

मुझे लगता है कि पंडों के साथ 1.0 वे वास्तव में आप नए 'string'डेटाटाइप का उपयोग करना चाहते हैं जो pd.NAमूल्यों के लिए अनुमति देता है:

>>> pd.Series([pd.NA, pd.NA]) == 'Hi'
0    False
1    False
>>> pd.Series([np.nan, np.nan], dtype='string') == 'Hi'
0    <NA>
1    <NA>
>>> (pd.Series([np.nan, np.nan], dtype='string') == 'Hi').fillna(False)
0    False
1    False

बूलियन इंडेक्सिंग जैसे हर दिन की कार्यक्षमता के साथ वे किस बिंदु पर प्यार करते हैं।

Question 10

मुझे यह चेतावनी मिली क्योंकि मुझे लगा कि मेरे कॉलम में अशक्त तार हैं, लेकिन जाँच करने पर, इसमें np.nan शामिल है!

if df['column'] == '':

मेरे तार को खाली तारों में बदलने से मदद मिली :)

Question 11

मैंने ऐसा करने के कुछ तरीकों की तुलना की है, जिसमें पांडा, कई सुन्न तरीके और एक सूची समझने की विधि शामिल है।

सबसे पहले, आइए आधार रेखा से शुरू करें:

>>> import numpy as np
>>> import operator
>>> import pandas as pd

>>> x = [1, 2, 1, 2]
>>> %time count = np.sum(np.equal(1, x))
>>> print("Count {} using numpy equal with ints".format(count))
CPU times: user 52 µs, sys: 0 ns, total: 52 µs
Wall time: 56 µs
Count 2 using numpy equal with ints

इसलिए, हमारी आधार रेखा यह है कि गिनती सही होनी चाहिए 2, और हमें इस बारे में जानकारी लेनी चाहिए 50 us।

अब, हम भोली विधि की कोशिश करते हैं:

>>> x = ['s', 'b', 's', 'b']
>>> %time count = np.sum(np.equal('s', x))
>>> print("Count {} using numpy equal".format(count))
CPU times: user 145 µs, sys: 24 µs, total: 169 µs
Wall time: 158 µs
Count NotImplemented using numpy equal
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  """Entry point for launching an IPython kernel.

और यहाँ, हमें गलत उत्तर मिलता है ( NotImplemented != 2), यह हमें एक लंबा समय लगता है, और यह चेतावनी फेंकता है।

तो हम एक और भोली विधि की कोशिश करेंगे:

>>> %time count = np.sum(x == 's')
>>> print("Count {} using ==".format(count))
CPU times: user 46 µs, sys: 1 µs, total: 47 µs
Wall time: 50.1 µs
Count 0 using ==

फिर, गलत जवाब ( 0 != 2)। यह और भी कपटी है क्योंकि इसके बाद की कोई चेतावनी नहीं है ( 0इसे वैसे ही पास किया जा सकता है 2)।

अब, एक सूची समझने की कोशिश करते हैं:

>>> %time count = np.sum([operator.eq(_x, 's') for _x in x])
>>> print("Count {} using list comprehension".format(count))
CPU times: user 55 µs, sys: 1 µs, total: 56 µs
Wall time: 60.3 µs
Count 2 using list comprehension

हमें यहाँ सही उत्तर मिलता है, और यह बहुत तेज़ है!

एक और संभावना pandas:

>>> y = pd.Series(x)
>>> %time count = np.sum(y == 's')
>>> print("Count {} using pandas ==".format(count))
CPU times: user 453 µs, sys: 31 µs, total: 484 µs
Wall time: 463 µs
Count 2 using pandas ==

धीरे, लेकिन सही!

और अंत में, मैं जिस विकल्प का उपयोग करने जा रहा हूं: numpyसरणी को objectटाइप करना:

>>> x = np.array(['s', 'b', 's', 'b']).astype(object)
>>> %time count = np.sum(np.equal('s', x))
>>> print("Count {} using numpy equal".format(count))
CPU times: user 50 µs, sys: 1 µs, total: 51 µs
Wall time: 55.1 µs
Count 2 using numpy equal

तेज और सही!

Question 12

मेरे पास यह कोड था जो त्रुटि पैदा कर रहा था:

for t in dfObj['time']:
  if type(t) == str:
    the_date = dateutil.parser.parse(t)
    loc_dt_int = int(the_date.timestamp())
    dfObj.loc[t == dfObj.time, 'time'] = loc_dt_int

मैंने इसे इसे बदल दिया:

for t in dfObj['time']:
  try:
    the_date = dateutil.parser.parse(t)
    loc_dt_int = int(the_date.timestamp())
    dfObj.loc[t == dfObj.time, 'time'] = loc_dt_int
  except Exception as e:
    print(e)
    continue

तुलना से बचने के लिए, जो चेतावनी को फेंक रहा है - जैसा कि ऊपर कहा गया है। dfObj.locलूप के कारण मुझे केवल अपवाद से बचना था , शायद यह बताने का एक तरीका है कि यह उन पंक्तियों की जांच न करें जो पहले से बदल चुकी हैं।