26

एक मैट्रिक्स के स्थायी कंप्यूटिंग के लिए संभव सबसे तेज़ कोड लिखना चुनौती है ।

nA -by- nमैट्रिक्स A= ( a_i,j) के स्थायी रूप में परिभाषित किया गया है

यहाँ S_nसभी क्रमपरिवर्तन के सेट का प्रतिनिधित्व करता है [1, n]।

एक उदाहरण के रूप में (विकी से):

इस प्रश्न में मैट्रिक्‍स सभी वर्गाकार हैं और इसमें केवल मान -1और 1सम्‍मिलित होंगे।

उदाहरण

इनपुट:

[[ 1 -1 -1  1]
 [-1 -1 -1  1]
 [-1  1 -1  1]
 [ 1 -1 -1  1]]

स्थायी:

-4

इनपुट:

[[-1 -1 -1 -1]
 [-1  1 -1 -1]
 [ 1 -1 -1 -1]
 [ 1 -1  1 -1]]

स्थायी:

इनपुट:

[[ 1 -1  1 -1 -1 -1 -1 -1]
 [-1 -1  1  1 -1  1  1 -1]
 [ 1 -1 -1 -1 -1  1  1  1]
 [-1 -1 -1  1 -1  1  1  1]
 [ 1 -1 -1  1  1  1  1 -1]
 [-1  1 -1  1 -1  1  1 -1]
 [ 1 -1  1 -1  1 -1  1 -1]
 [-1 -1  1 -1  1  1  1  1]]

स्थायी:

इनपुट:

[[1, -1, 1, -1, -1, 1, 1, 1, -1, -1, -1, -1, 1, 1, 1, 1, -1, 1, 1, -1],
 [1, -1, 1, 1, 1, 1, 1, -1, 1, -1, -1, 1, 1, 1, -1, -1, 1, 1, 1, -1],
 [-1, -1, 1, 1, 1, -1, -1, -1, -1, 1, -1, 1, 1, 1, -1, -1, -1, 1, -1, -1],
 [-1, -1, -1, 1, 1, -1, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1, 1, -1],
 [-1, 1, 1, 1, -1, 1, 1, 1, -1, -1, -1, 1, -1, 1, -1, 1, 1, 1, 1, 1],
 [1, -1, 1, 1, -1, -1, 1, -1, 1, 1, 1, 1, -1, 1, 1, -1, 1, -1, -1, -1],
 [1, -1, -1, 1, -1, -1, -1, 1, -1, 1, 1, 1, 1, -1, -1, -1, 1, 1, 1, -1],
 [1, -1, -1, 1, -1, 1, 1, -1, 1, 1, 1, -1, 1, -1, 1, 1, 1, -1, 1, 1],
 [1, -1, -1, -1, -1, -1, 1, 1, 1, -1, -1, -1, -1, -1, 1, 1, -1, 1, 1, -1],
 [-1, -1, 1, -1, 1, -1, 1, 1, -1, 1, -1, 1, 1, 1, 1, 1, 1, -1, 1, 1],
 [-1, -1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, 1, 1, 1, 1, -1, -1, -1, -1],
 [1, 1, -1, -1, -1, 1, 1, -1, -1, 1, -1, 1, 1, -1, 1, 1, 1, 1, 1, 1],
 [-1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, 1, -1, -1, -1, -1, -1, 1, -1, 1],
 [1, 1, -1, -1, -1, 1, -1, 1, -1, -1, -1, -1, 1, -1, 1, 1, -1, 1, -1, 1],
 [1, 1, 1, 1, 1, -1, -1, -1, 1, 1, 1, -1, 1, -1, 1, 1, 1, -1, 1, 1],
 [1, -1, -1, 1, -1, -1, -1, -1, 1, -1, -1, 1, 1, -1, 1, -1, -1, -1, -1, -1],
 [-1, 1, 1, 1, -1, 1, 1, -1, -1, 1, 1, 1, -1, -1, 1, 1, -1, -1, 1, 1],
 [1, 1, -1, -1, 1, 1, -1, 1, 1, -1, 1, 1, 1, -1, 1, 1, -1, 1, -1, 1],
 [1, 1, 1, -1, -1, -1, 1, -1, -1, 1, 1, -1, -1, -1, 1, -1, -1, -1, -1, 1],
 [-1, 1, 1, 1, -1, -1, -1, -1, -1, -1, -1, 1, 1, -1, 1, 1, -1, 1, -1, -1]]

स्थायी:

1021509632

काम

आपको कोड लिखना चाहिए nजो nमैट्रिक्स द्वारा दिया गया है, इसके स्थायी आउटपुट।

जैसा कि मुझे आपके कोड का परीक्षण करने की आवश्यकता होगी, यह उपयोगी होगा यदि आप मुझे अपने कोड के इनपुट के रूप में एक मैट्रिक्स देने के लिए एक सरल तरीका दे सकते हैं, उदाहरण के लिए मानक से पढ़कर।

सावधान रहें कि स्थायी बड़ा हो सकता है (सभी 1s मैट्रिक्स चरम मामला है)।

स्कोर और संबंध

मैं आकार बढ़ाने के यादृच्छिक + -1 मैट्रिसेस पर आपके कोड का परीक्षण करूंगा और पहली बार आपके कोड को मेरे कंप्यूटर पर 1 मिनट से अधिक समय तक रोकने के लिए। निष्पक्षता सुनिश्चित करने के लिए स्कोरिंग मैट्रीस सभी सबमिशन के लिए संगत होगा।

यदि दो लोगों को समान स्कोर मिलता है, तो विजेता वह है जो उस मूल्य के लिए सबसे तेज़ है n। यदि वे एक दूसरे के 1 सेकंड के भीतर हैं तो यह पहले पोस्ट किया गया एक है।

भाषा और पुस्तकालय

आप अपनी पसंद की किसी भी उपलब्ध भाषा और पुस्तकालयों का उपयोग कर सकते हैं, लेकिन स्थायी की गणना करने के लिए कोई पहले से मौजूद फ़ंक्शन नहीं। जहां संभव हो, अपने कोड को चलाने में सक्षम होना अच्छा होगा, इसलिए कृपया अपने कोड को लिनक्स में कैसे चलाएं / संकलित करें, यदि संभव हो तो पूर्ण विवरण शामिल करें। `

संदर्भ कार्यान्वयन

छोटे मैट्रिस के लिए स्थायी गणना के लिए विभिन्न भाषाओं में कोड के साथ कोडगोल्फ प्रश्न पहले से ही मौजूद है । यदि आप उन तक पहुँच प्राप्त कर सकते हैं, तो गणितज्ञ और मेपल भी दोनों का स्थायी कार्यान्वयन है।

माई मशीन द टाइमिंग को मेरी 64-बिट मशीन पर चलाया जाएगा। यह 8GB रैम, AMD FX-8350 Eight-Core Processor और Radeon HD 4250 के साथ एक मानक ubuntu स्थापित है। इसका मतलब यह भी है कि मुझे आपका कोड चलाने में सक्षम होने की आवश्यकता है।

मेरी मशीन के बारे में निम्न स्तर की जानकारी

cat /proc/cpuinfo/|grep flags देता है

झंडे: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse2 ht syscall xx mmxext fxsr_opting pdpe1gb rdtscp lm। cts_mc_sc_sc_s f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp tma नोड्स_br tbm topoext perfctr_core perfctr_core perfctr_core perfbr_nbbb.pb.bb

मैं एक बहु-भाषा से संबंधित अनुवर्ती सवाल पूछूंगा जो बड़ी इंट समस्या से ग्रस्त नहीं है इसलिए स्काला , निम , जूलिया , रस्ट , बैश के प्रेमी अपनी भाषाओं को भी दिखा सकते हैं।

लीडर बोर्ड

n = 33 (45 सेकंड। n = 34 के लिए 64 सेकंड)। टन ++ इंजील में सी ++ जी ++ 5.4.0 के साथ।
n = 32 (32 सेकंड)। सी में डेनिस 5.4 सीसी के साथ टन हास्पेल के जीसीसी झंडे का उपयोग कर रहा है।
n = 31 (54 सेकंड)। हास्केल में क्रिश्चियन सिवर्स
n = 31 (60 सेकंड)। Primo में rpython
n = 30 (26 सेकंड)। जंग में ezrast
n = 28 (49 सेकंड)। अजगर के साथ xnor + pypy 5.4.1
n = 22 (25 सेकंड)। अजगर के साथ शेबंग + pypy 5.4.1

ध्यान दें । डेनिस और टन हास्पेल के लिए समय के कारण रहस्यमय कारणों से बहुत भिन्नता है। उदाहरण के लिए मैं एक वेब ब्राउज़र को लोड करने के बाद वे अधिक तेज़ लगते हैं! मेरे द्वारा किए गए सभी परीक्षणों में उद्धृत समय सबसे तेज़ हैं।

math fastest-code matrix

— sergiol
स्रोत

5

मैंने पहला वाक्य पढ़ा, 'लेम्बिक' सोचा, नीचे स्क्रॉल किया, हाँ - लेम्बिक।

— orlp

@orlp :) यह एक लंबा समय रहा है।

1

@ लेम्बिक मैंने एक बड़ा परीक्षण मामला जोड़ा। मैं यह सुनिश्चित करने के लिए किसी की प्रतीक्षा करूँगा।

— xnor

2

उत्तर में से एक एक अनुमानित परिणाम प्रिंट करता है, क्योंकि यह स्थायी को संग्रहीत करने के लिए डबल सटीक फ़्लोट का उपयोग करता है। क्या इसकी अनुमति है?

— डेनिस

1

@ChristianSever मैंने सोचा कि मैं संकेतों के साथ कुछ जादू करने में सक्षम हो सकता हूं, लेकिन यह पैन नहीं किया ...

— सुकरांत फीनिक्स

14

gcc C ++ n (36 (मेरे सिस्टम पर 57 सेकंड)

अद्यतन के लिए एक ग्रे कोड के साथ Glynn सूत्र का उपयोग करता है यदि सभी कॉलम सम हैं, अन्यथा Ryser की विधि का उपयोग करता है। पिरोया और सदिश। AVX के लिए अनुकूलित है, इसलिए पुराने प्रोसेसर पर ज्यादा उम्मीद न करें। n>=35केवल + 1 के साथ मैट्रिक्स के लिए परेशान न करें, भले ही आपका सिस्टम पर्याप्त रूप से तेज हो क्योंकि हस्ताक्षर किए गए 128 बिट संचायक अतिप्रवाह करेंगे। यादृच्छिक मेट्रिसेस के लिए आप शायद अतिप्रवाह नहीं मारेंगे। के लिए n>=37आंतरिक मल्टीप्लायरों एक सब के लिए अतिप्रवाह शुरू कर देंगे 1/-1मैट्रिक्स। तो केवल इस कार्यक्रम के लिए उपयोग करें n<=36।

बस किसी भी तरह के व्हाट्सएप द्वारा अलग किए गए STDIN पर मैट्रिक्स तत्व दें

permanent
1 2
3 4
^D

permanent.cpp:

/*
  Compile using something like:
    g++ -Wall -O3 -march=native -fstrict-aliasing -std=c++11 -pthread -s permanent.cpp -o permanent
*/

#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <cstdint>
#include <climits>
#include <array>
#include <vector>
#include <thread>
#include <future>
#include <ctgmath>
#include <immintrin.h>

using namespace std;

bool const DEBUG = false;
int const CACHE = 64;

using Index  = int_fast32_t;
Index glynn;
// Number of elements in our vectors
Index const POW   = 3;
Index const ELEMS = 1 << POW;
// Over how many floats we distribute each row
Index const WIDTH = 9;
// Number of bits in the fraction part of a floating point number
int const FLOAT_MANTISSA = 23;
// Type to use for the first add/multiply phase
using Sum  = float;
using SumN = __restrict__ Sum __attribute__((vector_size(ELEMS*sizeof(Sum))));
// Type to convert to between the first and second phase
using ProdN = __restrict__ int32_t __attribute__((vector_size(ELEMS*sizeof(int32_t))));
// Type to use for the third and last multiply phase.
// Also used for the final accumulator
using Value = __int128;
using UValue = unsigned __int128;

// Wrap Value so C++ doesn't really see it and we can put it in vectors etc.
// Needed since C++ doesn't fully support __int128
struct Number {
    Number& operator+=(Number const& right) {
        value += right.value;
        return *this;
    }
    // Output the value
    void print(ostream& os, bool dbl = false) const;
    friend ostream& operator<<(ostream& os, Number const& number) {
        number.print(os);
        return os;
    }

    Value value;
};

using ms = chrono::milliseconds;

auto nr_threads = thread::hardware_concurrency();
vector<Sum> input;

// Allocate cache aligned datastructures
template<typename T>
T* alloc(size_t n) {
    T* mem = static_cast<T*>(aligned_alloc(CACHE, sizeof(T) * n));
    if (mem == nullptr) throw(bad_alloc());
    return mem;
}

// Work assigned to thread k of nr_threads threads
Number permanent_part(Index n, Index k, SumN** more) {
    uint64_t loops = (UINT64_C(1) << n) / nr_threads;
    if (glynn) loops /= 2;
    Index l = loops < ELEMS ? loops : ELEMS;
    loops /= l;
    auto from = loops * k;
    auto to   = loops * (k+1);

    if (DEBUG) cout << "From=" << from << "\n";
    uint64_t old_gray = from ^ from/2;
    uint64_t bit = 1;
    bool bits = (to-from) & 1;

    Index nn = (n+WIDTH-1)/WIDTH;
    Index ww = nn * WIDTH;
    auto column = alloc<SumN>(ww);
    for (Index i=0; i<n; ++i)
        for (Index j=0; j<ELEMS; ++j) column[i][j] = 0;
    for (Index i=n; i<ww; ++i)
        for (Index j=0; j<ELEMS; ++j) column[i][j] = 1;
    Index b;
    if (glynn) {
        b = n > POW+1 ? n - POW - 1: 0;
        auto c = n-1-b;
        for (Index k=0; k<l; k++) {
            Index gray = k ^ k/2;
            for (Index j=0; j< c; ++j)
                if (gray & 1 << j)
                    for (Index i=0; i<n; ++i)
                        column[i][k] -= input[(b+j)*n+i];
                else
                    for (Index i=0; i<n; ++i)
                        column[i][k] += input[(b+j)*n+i];
        }
        for (Index i=0; i<n; ++i)
            for (Index k=0; k<l; k++)
                column[i][k] += input[n*(n-1)+i];

        for (Index k=1; k<l; k+=2)
            column[0][k] = -column[0][k];

        for (Index i=0; i<b; ++i, bit <<= 1) {
            if (old_gray & bit) {
                bits = bits ^ 1;
                for (Index j=0; j<ww; ++j)
                    column[j] -= more[i][j];
            } else {
                for (Index j=0; j<ww; ++j)
                    column[j] += more[i][j];
            }
        }

        for (Index i=0; i<n; ++i)
            for (Index k=0; k<l; k++)
                column[i][k] /= 2;
    } else {
        b = n > POW ? n - POW : 0;
        auto c = n-b;
        for (Index k=0; k<l; k++) {
            Index gray = k ^ k/2;
            for (Index j=0; j<c; ++j)
                if (gray & 1 << j)
                    for (Index i=0; i<n; ++i)
                        column[i][k] -= input[(b+j)*n+i];
        }

        for (Index k=1; k<l; k+=2)
            column[0][k] = -column[0][k];

        for (Index i=0; i<b; ++i, bit <<= 1) {
            if (old_gray & bit) {
                bits = bits ^ 1;
                for (Index j=0; j<ww; ++j)
                    column[j] -= more[i][j];
            }
        }
    }

    if (DEBUG) {
        for (Index i=0; i<ww; ++i) {
            cout << "Column[" << i << "]=";
            for (Index j=0; j<ELEMS; ++j) cout << " " << column[i][j];
            cout << "\n";
        }
    }

    --more;
    old_gray = (from ^ from/2) | UINT64_C(1) << b;
    Value total = 0;
    SumN accu[WIDTH];
    for (auto p=from; p<to; ++p) {
        uint64_t new_gray = p ^ p/2;
        uint64_t bit = old_gray ^ new_gray;
        Index i = __builtin_ffsl(bit);
        auto diff = more[i];
        auto c = column;
        if (new_gray > old_gray) {
            // Phase 1 add/multiply.
            // Uses floats until just before loss of precision
            for (Index i=0; i<WIDTH; ++i) accu[i] = *c++ -= *diff++;

            for (Index j=1; j < nn; ++j)
                for (Index i=0; i<WIDTH; ++i) accu[i] *= *c++ -= *diff++;
        } else {
            // Phase 1 add/multiply.
            // Uses floats until just before loss of precision
            for (Index i=0; i<WIDTH; ++i) accu[i] = *c++ += *diff++;

            for (Index j=1; j < nn; ++j)
                for (Index i=0; i<WIDTH; ++i) accu[i] *= *c++ += *diff++;
        }

        if (DEBUG) {
            cout << "p=" << p << "\n";
            for (Index i=0; i<ww; ++i) {
                cout << "Column[" << i << "]=";
                for (Index j=0; j<ELEMS; ++j) cout << " " << column[i][j];
                cout << "\n";
            }
        }

        // Convert floats to int32_t
        ProdN prod32[WIDTH] __attribute__((aligned (32)));
        for (Index i=0; i<WIDTH; ++i)
            // Unfortunately gcc doesn't recognize the static_cast<int32_t>
            // as a vector pattern, so force it with an intrinsic
#ifdef __AVX__
            //prod32[i] = static_cast<ProdN>(accu[i]);
            reinterpret_cast<__m256i&>(prod32[i]) = _mm256_cvttps_epi32(accu[i]);
#else   // __AVX__
            for (Index j=0; j<ELEMS; ++j)
                prod32[i][j] = static_cast<int32_t>(accu[i][j]);
#endif  // __AVX__

        // Phase 2 multiply. Uses int64_t until just before overflow
        int64_t prod64[3][ELEMS];
        for (Index i=0; i<3; ++i) {
            for (Index j=0; j<ELEMS; ++j)
                prod64[i][j] = static_cast<int64_t>(prod32[i][j]) * prod32[i+3][j] * prod32[i+6][j];
        }
        // Phase 3 multiply. Collect into __int128. For large matrices this will
        // actually overflow but that's ok as long as all 128 low bits are
        // correct. Terms will cancel and the final sum can fit into 128 bits
        // (This will start to fail at n=35 for the all 1 matrix)
        // Strictly speaking this needs the -fwrapv gcc option
        for (Index j=0; j<ELEMS; ++j) {
            auto value = static_cast<Value>(prod64[0][j]) * prod64[1][j] * prod64[2][j];
            if (DEBUG) cout << "value[" << j << "]=" << static_cast<double>(value) << "\n";
            total += value;
        }
        total = -total;

        old_gray = new_gray;
    }

    return bits ? Number{-total} : Number{total};
}

// Prepare datastructures, Assign work to threads
Number permanent(Index n) {
    Index nn = (n+WIDTH-1)/WIDTH;
    Index ww = nn*WIDTH;

    Index rows  = n > (POW+glynn) ? n-POW-glynn : 0;
    auto data = alloc<SumN>(ww*(rows+1));
    auto pointers = alloc<SumN *>(rows+1);
    auto more = &pointers[0];
    for (Index i=0; i<rows; ++i)
        more[i] = &data[ww*i];
    more[rows] = &data[ww*rows];
    for (Index j=0; j<ww; ++j)
        for (Index i=0; i<ELEMS; ++i)
            more[rows][j][i] = 0;

    Index loops = n >= POW+glynn ? ELEMS : 1 << (n-glynn);
    auto a = &input[0];
    for (Index r=0; r<rows; ++r) {
        for (Index j=0; j<n; ++j) {
            for (Index i=0; i<loops; ++i)
                more[r][j][i] = j == 0 && i %2 ? -*a : *a;
            for (Index i=loops; i<ELEMS; ++i)
                more[r][j][i] = 0;
            ++a;
        }
        for (Index j=n; j<ww; ++j)
            for (Index i=0; i<ELEMS; ++i)
                more[r][j][i] = 0;
    }

    if (DEBUG)
        for (Index r=0; r<=rows; ++r)
            for (Index j=0; j<ww; ++j) {
                cout << "more[" << r << "][" << j << "]=";
                for (Index i=0; i<ELEMS; ++i)
                    cout << " " << more[r][j][i];
                cout << "\n";
            }

    // Send work to threads...
    vector<future<Number>> results;
    for (auto i=1U; i < nr_threads; ++i)
        results.emplace_back(async(DEBUG ? launch::deferred: launch::async, permanent_part, n, i, more));
    // And collect results
    auto r = permanent_part(n, 0, more);
    for (auto& result: results)
        r += result.get();

    free(data);
    free(pointers);

    // For glynn we should double the result, but we will only do this during
    // the final print. This allows n=34 for an all 1 matrix to work
    // if (glynn) r *= 2;
    return r;
}

// Print 128 bit number
void Number::print(ostream& os, bool dbl) const {
    const UValue BILLION = 1000000000;

    UValue val;
    if (value < 0) {
        os << "-";
        val = -value;
    } else
        val = value;
    if (dbl) val *= 2;

    uint32_t output[5];
    for (int i=0; i<5; ++i) {
        output[i] = val % BILLION;
        val /= BILLION;
    }
    bool print = false;
    for (int i=4; i>=0; --i) {
        if (print) {
            os << setfill('0') << setw(9) << output[i];
        } else if (output[i] || i == 0) {
            print = true;
            os << output[i];
        }
    }
}

// Read matrix, check for sanity
void my_main() {
    Sum a;
    while (cin >> a)
        input.push_back(a);

    size_t n = sqrt(input.size());
    if (input.size() != n*n)
        throw(logic_error("Read " + to_string(input.size()) +
                          " elements which does not make a square matrix"));

    vector<double> columns_pos(n, 0);
    vector<double> columns_neg(n, 0);
    Sum *p = &input[0];
    for (size_t i=0; i<n; ++i)
        for (size_t j=0; j<n; ++j, ++p) {
            if (*p >= 0) columns_pos[j] += *p;
            else         columns_neg[j] -= *p;
        }
    std::array<double,WIDTH> prod;
    prod.fill(1);

    int32_t odd = 0;
    for (size_t j=0; j<n; ++j) {
        prod[j%WIDTH] *= max(columns_pos[j], columns_neg[j]);
        auto sum = static_cast<int32_t>(columns_pos[j] - columns_neg[j]);
        odd |= sum;
    }
    glynn = (odd & 1) ^ 1;
    for (Index i=0; i<WIDTH; ++i)
        // A float has an implicit 1. in front of the fraction so it can
        // represent 1 bit more than the mantissa size. And 1 << (mantissa+1)
        // itself is in fact representable
        if (prod[i] && log2(prod[i]) > FLOAT_MANTISSA+1)
            throw(range_error("Values in matrix are too large. A subproduct reaches " + to_string(prod[i]) + " which doesn't fit in a float without loss of precision"));

    for (Index i=0; i<3; ++i) {
        auto prod3 = prod[i] * prod[i+3] * prod[i+6];
        if (log2(prod3) >= CHAR_BIT*sizeof(int64_t)-1)
            throw(range_error("Values in matrix are too large. A subproduct reaches " + to_string(prod3) + " which doesn't fit in an int64"));
    }

    nr_threads = pow(2, ceil(log2(static_cast<float>(nr_threads))));
    uint64_t loops = UINT64_C(1) << n;
    if (glynn) loops /= 2;
    if (nr_threads * ELEMS > loops)
        nr_threads = max(loops / ELEMS, UINT64_C(1));
    // if (DEBUG) nr_threads = 1;

    cout << n << " x " << n << " matrix, method " << (glynn ? "Glynn" : "Ryser") << ", " << nr_threads << " threads" << endl;

    // Go for the actual calculation
    auto start = chrono::steady_clock::now();
    auto perm = permanent(n);
    auto end = chrono::steady_clock::now();
    auto elapsed = chrono::duration_cast<ms>(end-start).count();

    cout << "Permanent=";
    perm.print(cout, glynn);
    cout << " (" << elapsed / 1000. << " s)" << endl;
}

// Wrapper to print any exceptions
int main() {
    try {
        my_main();
    } catch(exception& e) {
        cerr << "Error: " << e.what() << endl;
        exit(EXIT_FAILURE);
    }
    exit(EXIT_SUCCESS);
}

— टन हास्पेल
स्रोत

झंडे: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse2 ht syscall xx mmxext fxsr_opting pdpe1gb rdtscp lm। cts_mc_sc_sc_s f16c lahf_lm cmp_legacy SVM extapic cr8_legacy एबीएम sse4a misalignsse 3dnowprefetch osvw IBS XOP skinit WDT LWP fma4 टीसीई nodeid_msr टीबीएम topoext perfctr_core perfctr_nb CPB hw_pstate vmmcall BMI1 Arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold

मैं अभी भी अपने कोड को चलाने के लिए अपने परीक्षण दोहन पर बहस कर रहा हूं, लेकिन यह बहुत तेज़ दिखता है, धन्यवाद! मैं सोच रहा था कि क्या बड़ा अंतर आकार गति की समस्या पैदा कर सकता है (जैसा कि आपने सुझाव दिया)। मैंने किसी भी मामले में accu.org/index.php/articles/1849 देखा ।

मुझे आपके कोड को quick_exit को हटाने के लिए संशोधित करना पड़ा क्योंकि उन लोगों ने टेस्ट हार्नेस में इसका उपयोग करना बहुत कठिन बना दिया। रुचि से बाहर, आप Ryser के फॉर्मूले का उपयोग क्यों कर रहे हैं, जब विकी दावा करता है कि दूसरे को दो बार तेज होना चाहिए?

@ लेम्बिक I ने Ryser के सूत्र पर स्विच किया क्योंकि दूसरे के साथ मुझे 2 << (n-1)अंत में वापस स्केल करने की आवश्यकता है जिसका मतलब है कि मेरा int128 संचायक उस बिंदु से पहले बह निकला।

— टन हास्पेल

1

@ लिम्बिक यस :-)

— टन हास्पेल

7

C99, n, 33 (35 सेकंड)

#include <stdint.h>
#include <stdio.h>

#define CHUNK_SIZE 12
#define NUM_THREADS 8

#define popcnt __builtin_popcountll
#define BILLION (1000 * 1000 * 1000)
#define UPDATE_ROW_PPROD() \
    update_row_pprod(row_pprod, row, rows, row_sums, mask, mask_popcnt)

typedef __int128 int128_t;

static inline int64_t update_row_pprod
(
    int64_t* row_pprod, int64_t row, int64_t* rows,
    int64_t* row_sums, int64_t mask, int64_t mask_popcnt
)
{
    int64_t temp = 2 * popcnt(rows[row] & mask) - mask_popcnt;

    row_pprod[0] *= temp;
    temp -= 1;
    row_pprod[1] *= temp;
    temp -= row_sums[row];
    row_pprod[2] *= temp;
    temp += 1;
    row_pprod[3] *= temp;

    return row + 1;
}

int main(int argc, char* argv[])
{
    int64_t size = argc - 1, rows[argc - 1];
    int64_t row_sums[argc - 1];
    int128_t permanent = 0, sign = size & 1 ? -1 : 1;

    if (argc == 2)
    {
        printf("%d\n", argv[1][0] == '-' ? -1 : 1);
        return 0;
    }

    for (int64_t row = 0; row < size; row++)
    {
        char positive = argv[row + 1][0] == '+' ? '-' : '+';

        sign *= ',' - positive;
        rows[row] = row_sums[row] = 0;

        for (char* p = &argv[row + 1][1]; *p; p++)
        {
            rows[row] <<= 1;
            rows[row] |= *p == positive;
            row_sums[row] += *p == positive;
        }

        row_sums[row] = 2 * row_sums[row] - size;
    }

    #pragma omp parallel for reduction(+:permanent) num_threads(NUM_THREADS)
    for (int64_t mask = 1; mask < 1LL << (size - 1); mask += 2)
    {
        int64_t mask_popcnt = popcnt(mask);
        int64_t row = 0;
        int128_t row_prod = 1 - 2 * (mask_popcnt & 1);
        int128_t row_prod_high = -row_prod;
        int128_t row_prod_inv = row_prod;
        int128_t row_prod_inv_high = -row_prod;

        for (int64_t chunk = 0; chunk < size / CHUNK_SIZE; chunk++)
        {
            int64_t row_pprod[4] = {1, 1, 1, 1};

            for (int64_t i = 0; i < CHUNK_SIZE; i++)
                row = UPDATE_ROW_PPROD();

            row_prod *= row_pprod[0], row_prod_high *= row_pprod[1];
            row_prod_inv *= row_pprod[3], row_prod_inv_high *= row_pprod[2];
        }

        int64_t row_pprod[4] = {1, 1, 1, 1};

        while (row < size)
            row = UPDATE_ROW_PPROD();

        row_prod *= row_pprod[0], row_prod_high *= row_pprod[1];
        row_prod_inv *= row_pprod[3], row_prod_inv_high *= row_pprod[2];
        permanent += row_prod + row_prod_high + row_prod_inv + row_prod_inv_high;
    }

    permanent *= sign;

    if (permanent < 0)
        printf("-"), permanent *= -1;

    int32_t output[5], print = 0;

    output[0] = permanent % BILLION, permanent /= BILLION;
    output[1] = permanent % BILLION, permanent /= BILLION;
    output[2] = permanent % BILLION, permanent /= BILLION;
    output[3] = permanent % BILLION, permanent /= BILLION;
    output[4] = permanent % BILLION;

    if (output[4])
        printf("%u", output[4]), print = 1;
    if (print)
        printf("%09u", output[3]);
    else if (output[3])
        printf("%u", output[3]), print = 1;
    if (print)
        printf("%09u", output[2]);
    else if (output[2])
        printf("%u", output[2]), print = 1;
    if (print)
        printf("%09u", output[1]);
    else if (output[1])
        printf("%u", output[1]), print = 1;
    if (print)
        printf("%09u\n", output[0]);
    else
        printf("%u\n", output[0]);
}

इनपुट वर्तमान में थोड़ा बोझिल है; इसे कमांड लाइन के तर्कों के रूप में पंक्तियों के साथ लिया जाता है, जहां प्रत्येक प्रविष्टि को उसके संकेत द्वारा दर्शाया जाता है, अर्थात, + 1 इंगित करता है और - एक -1 इंगित करता है ।

परीक्षण चालन

$ gcc -Wall -std=c99 -march=native -Ofast -fopenmp -fwrapv -o permanent permanent.c
$ ./permanent +--+ ---+ -+-+ +--+
-4
$ ./permanent ---- -+-- +--- +-+-
0
$ ./permanent +-+----- --++-++- +----+++ ---+-+++ +--++++- -+-+-++- +-+-+-+- --+-++++
192
$ ./permanent +-+--+++----++++-++- +-+++++-+--+++--+++- --+++----+-+++---+-- ---++-++++++------+- -+++-+++---+-+-+++++ +-++--+-++++-++-+--- +--+---+-++++---+++- +--+-++-+++-+-+++-++ +-----+++-----++-++- --+-+-++-+-++++++-++ -------+----++++---- ++---++--+-++-++++++ -++-----++++-----+-+ ++---+-+----+-++-+-+ +++++---+++-+-+++-++ +--+----+--++-+----- -+++-++--+++--++--++ ++--++-++-+++-++-+-+ +++---+--++---+----+ -+++-------++-++-+--
1021509632
$ time ./permanent +++++++++++++++++++++++++++++++{,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,}     # 31
8222838654177922817725562880000000

real    0m8.365s
user    1m6.504s
sys     0m0.000s
$ time ./permanent ++++++++++++++++++++++++++++++++{,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,}   # 32
263130836933693530167218012160000000

real    0m17.013s
user    2m15.226s
sys     0m0.001s
$ time ./permanent +++++++++++++++++++++++++++++++++{,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} # 33
8683317618811886495518194401280000000

real    0m34.592s
user    4m35.354s
sys     0m0.001s

— डेनिस
स्रोत

क्या आपके पास सुधार के लिए कोई विचार है?

— 1

@xnor कुछ। मैं एसएसई के साथ पैक्ड गुणा की कोशिश करना चाहता हूं और आंशिक रूप से बड़े लूप को अनियंत्रित करना (यह देखने के लिए कि क्या मैं समानांतरकरण को गति दे सकता हूं और बिना कॉल किए एक बार में 4 से अधिक मूल्यों की गणना कर सकता हूं popcnt)। यदि वह किसी भी समय बचाता है, तो अगली बड़ी बाधा पूर्णांक प्रकार है। बेतरतीब ढंग से उत्पन्न मैट्रिस के लिए, स्थायी तुलनात्मक रूप से छोटा है। यदि मैं वास्तविक गणना करने से पहले एक बाध्य गणना करने का एक आसान तरीका खोज सकता हूं, तो मैं पूरी स्थिति को एक बड़ी शर्त में लपेट सकता हूं।

— डेनिस

@ डेनिस लूप को अनियंत्रित करने के बारे में, शीर्ष पंक्ति को सभी + 1 के होने के लिए एक छोटा संभव अनुकूलन है।

— xnor

@xnor हाँ, मैं करने की कोशिश की कुछ बिंदु पर है कि है, लेकिन फिर परिवर्तन वापस कुछ और (जो ऐसा नहीं हो पाया कोशिश करने के लिए सभी पर )। अड़चन पूर्णांक गुणा (जो 64 बिट्स के लिए धीमी है और वास्तव में 128 के लिए धीमी है) लगता है, यही कारण है कि मुझे उम्मीद है कि एसएसई थोड़ा मदद करेगा।

— डेनिस

1

@ डेनिस मैं देखता हूं। सीमाओं के बारे में, एक गैर-स्पष्ट बाध्यता ऑपरेटर मानदंड के संदर्भ में है। प्रति (एम) | <= | एम | ^ एन। देखें arxiv.org/pdf/1606.07474v1.pdf

— XNOR

5

अजगर 2, एन। 28

from operator import mul

def fast_glynn_perm(M):
    row_comb = [sum(c) for c in zip(*M)]
    n=len(M)

    total = 0
    old_grey = 0 
    sign = +1

    binary_power_dict = {2**i:i for i in range(n)}
    num_loops = 2**(n-1)

    for bin_index in xrange(1, num_loops + 1):  
        total += sign * reduce(mul, row_comb)

        new_grey = bin_index^(bin_index/2)
        grey_diff = old_grey ^ new_grey
        grey_diff_index = binary_power_dict[grey_diff]

        new_vector = M[grey_diff_index]
        direction = 2 * cmp(old_grey,new_grey)      

        for i in range(n):
            row_comb[i] += new_vector[i] * direction

        sign = -sign
        old_grey = new_grey

    return total/num_loops

अपडेट के लिए ग्रे कोड के साथ Glynn सूत्र का उपयोग करता है । मेरी मशीन पर एक मिनट में चलता है । एक निश्चित रूप से इसे तेज भाषा में और बेहतर डेटा संरचनाओं के साथ बेहतर क्रियान्वयन कर सकता है। यह उपयोग नहीं करता है कि मैट्रिक्स-1-मूल्यवान है।n=23

एक Ryser सूत्र कार्यान्वयन बहुत समान है,-1-वैक्टर के बजाय गुणांक के सभी 0/1 वैक्टर पर सिंक करें। यह लगभग दो बार के रूप में लंबे समय के रूप में Glynn सूत्र के रूप में लेता है क्योंकि सभी 2 ^ n ऐसे वैक्टर पर जोड़ता है, जबकि Glynn का पड़ाव है कि केवल शुरू करने वालों के लिए समरूपता का उपयोग करना +1।

from operator import mul

def fast_ryser_perm(M):
    n=len(M)
    row_comb = [0] * n

    total = 0
    old_grey = 0 
    sign = +1

    binary_power_dict = {2**i:i for i in range(n)}
    num_loops = 2**n

    for bin_index in range(1, num_loops) + [0]: 
        total += sign * reduce(mul, row_comb)

        new_grey = bin_index^(bin_index/2)
        grey_diff = old_grey ^ new_grey
        grey_diff_index = binary_power_dict[grey_diff]

        new_vector = M[grey_diff_index]
        direction = cmp(old_grey, new_grey)

        for i in range(n):
            row_comb[i] += new_vector[i] * direction

        sign = -sign
        old_grey = new_grey

    return total * (-1)**n

— XNOR
स्रोत

बहुत बढ़िया। क्या आपको परीक्षण करने के लिए भी लज्जत मिली है?

@Lembik नहीं, मैं ज्यादा स्थापित नहीं है।

— xnor 10

जब मैं इसका परीक्षण करूंगा तब भी मैं pypy का उपयोग करूंगा। क्या आप देख सकते हैं कि दूसरे फास्ट फॉर्मूला को कैसे लागू किया जाए? मुझे यह भ्रामक लगता है।

@Lembik अन्य फास्ट फॉर्मूला क्या है?

— xnor

1

संदर्भ के रूप में, इसके साथ मेरी मशीन पर 44.6 सेकंड में pypyआसानी से गणना करने में सक्षम था n=28। लिम्बिक की प्रणाली गति में खदान के लिए काफी हद तक तुलनात्मक प्रतीत होती है यदि थोड़ी तेज नहीं है।

— केड

4

जंग + एक्स्ट्रीम

ग्रे कोड कार्यान्वयन के साथ यह सीधा Ryser मेरे लैपटॉप पर n = 31 चलाने के लिए लगभग 65 90 सेकंड लेता है । ~~मुझे लगता है कि आपकी मशीन 60 के दशक में अच्छी तरह से वहां पहुंच जाएगी।~~ मैं एक्स्ट्रीम का उपयोग कर रहा हूँ 1.1.1 के लिए i128।

मैंने कभी भी रस्ट का उपयोग नहीं किया है और मुझे पता नहीं है कि मैं क्या कर रहा हूं। जो कुछ भी cargo build --releaseकरता है उसके अलावा कोई संकलक विकल्प नहीं है। टिप्पणियाँ / सुझाव / अनुकूलन की सराहना की जाती है।

आह्वान डेनिस कार्यक्रम के समान है।

use std::env;
use std::thread;
use std::sync::Arc;
use std::sync::mpsc;

extern crate extprim;
use extprim::i128::i128;

static THREADS : i64 = 8; // keep this a power of 2.

fn main() {
  // Read command line args for the matrix, specified like
  // "++- --- -+-" for [[1, 1, -1], [-1, -1, -1], [-1, 1, -1]].
  let mut args = env::args();
  args.next();

  let mat : Arc<Vec<Vec<i64>>> = Arc::new(args.map( |ss|
    ss.trim().bytes().map( |cc| if cc == b'+' {1} else {-1}).collect()
  ).collect());

  // Figure how many iterations each thread has to do.
  let size = 2i64.pow(mat.len() as u32);
  let slice_size = size / THREADS; // Assumes divisibility.

  let mut accumulator : i128;
  if slice_size >= 4 { // permanent() requires 4 divides slice_size
    let (tx, rx) = mpsc::channel();

    // Launch threads.
    for ii in 0..THREADS {
      let mat = mat.clone();
      let tx = tx.clone();
      thread::spawn(move ||
        tx.send(permanent(&mat, ii * slice_size, (ii+1) * slice_size))
      );
    }

    // Accumulate results.
    accumulator = extprim::i128::ZERO;
    for _ in 0..THREADS {
      accumulator += rx.recv().unwrap();
    }
  }
  else { // Small matrix, don't bother threading.
    accumulator = permanent(&mat, 0, size);
  }
  println!("{}", accumulator);
}

fn permanent(mat: &Vec<Vec<i64>>, start: i64, end: i64) -> i128 {
  let size = mat.len();
  let sentinel = std::i64::MAX / size as i64;

  let mut bits : Vec<bool> = Vec::with_capacity(size);
  let mut sums : Vec<i64> = Vec::with_capacity(size);

  // Initialize gray code bits.
  let gray_number = start ^ (start / 2);

  for row in 0..size {
    bits.push((gray_number >> row) % 2 == 1);
    sums.push(0);
  }

  // Initialize column sums
  for row in 0..size {
    if bits[row] {
      for column in 0..size {
        sums[column] += mat[row][column];
      }
    }
  }

  // Do first two iterations with initial sums
  let mut total = product(&sums, sentinel);
  for column in 0..size {
    sums[column] += mat[0][column];
  }
  bits[0] = true;

  total -= product(&sums, sentinel);

  // Do rest of iterations updating gray code bits incrementally
  let mut gray_bit : usize;
  let mut idx = start + 2;
  while idx < end {
    gray_bit = idx.trailing_zeros() as usize;

    if bits[gray_bit] {
      for column in 0..size {
        sums[column] -= mat[gray_bit][column];
      }
      bits[gray_bit] = false;
    }
    else {
      for column in 0..size {
        sums[column] += mat[gray_bit][column];
      }
      bits[gray_bit] = true;
    }

    total += product(&sums, sentinel);

    if bits[0] {
      for column in 0..size {
        sums[column] -= mat[0][column];
      }
      bits[0] = false;
    }
    else {
      for column in 0..size {
        sums[column] += mat[0][column];
      }
      bits[0] = true;
    }

    total -= product(&sums, sentinel);
    idx += 2;
  }
  return if size % 2 == 0 {total} else {-total};
}

#[inline]
fn product(sums : &Vec<i64>, sentinel : i64) -> i128 {
  let mut ret : Option<i128> = None;
  let mut tally = sums[0];
  for ii in 1..sums.len() {
    if tally.abs() >= sentinel {
      ret = Some(ret.map_or(i128::new(tally), |n| n * i128::new(tally)));
      tally = sums[ii];
    }
    else {
      tally *= sums[ii];
    }
  }
  if ret.is_none() {
    return i128::new(tally);
  }
  return ret.unwrap() * i128::new(tally);
}

— ezrast
स्रोत

क्या आप एक्स्ट्रीम को स्थापित करने और कोड को संकलित करने के लिए कॉपी और पेस्ट करने योग्य कमांड लाइन दे सकते हैं।

आउटपुट "i128! (- 2)" जैसा दिखता है, जहां -2 सही उत्तर है। क्या यह अपेक्षित है और क्या आप इसे केवल स्थायी उत्पादन के लिए बदल सकते हैं?

1

@ लेम्बिक: आउटपुट अब तय किया जाना चाहिए। ऐसा लगता है कि आपने संकलन तैयार कर लिया है, लेकिन मैंने इसे Git में फेंक दिया है ताकि आप यह कर सकें git clone https://gitlab.com/ezrast/permanent.git; cd permanent; cargo build --releaseकि आप मेरे जैसा ही सेटअप रखना चाहते हैं। कार्गो निर्भरता को संभाल लेंगे। बाइनरी में चला जाता है target/release।

— २।

दुर्भाग्य से यह n = 29 के लिए गलत उत्तर देता है। bpaste.net/show/99d6e826d968

1

@Lembik gah, क्षमा करें, मध्यवर्ती मान पहले से कहीं अधिक थे जो मैंने सोचा था। यह तय हो गया है, हालांकि कार्यक्रम अब बहुत धीमा है।

— 1

4

हास्केल, एन = 31 (54)

@Angs द्वारा बहुत सारे अमूल्य योगदान के साथ: Vectorशॉर्ट सर्किट उत्पादों का उपयोग करें , विषम n को देखें।

import Control.Parallel.Strategies
import qualified Data.Vector.Unboxed as V
import Data.Int

type Row = V.Vector Int8

x :: Row -> [Row] -> Integer -> Int -> Integer
x p (v:vs) m c = let c' = c - 1
                     r = if c>0 then parTuple2 rseq rseq else r0
                     (a,b) = ( x p                  vs m    c' ,
                               x (V.zipWith(-) p v) vs (-m) c' )
                             `using` r
                 in a+b
x p _      m _ = prod m p

prod :: Integer -> Row -> Integer
prod m p = if 0 `V.elem` p then 0 
                           else V.foldl' (\a b->a*fromIntegral b) m p

p, pt :: [Row] -> Integer
p (v:vs) = x (foldl (V.zipWith (+)) v vs) (map (V.map (2*)) vs) 1 11
           `div` 2^(length vs)
p [] = 1 -- handle 0x0 matrices too  :-)

pt (v:vs) | even (length vs) = p ((V.map (2*) v) : vs ) `div` 2
pt mat                       = p mat

main = getContents >>= print . pt . map V.fromList . read

हास्केल में समानता पर मेरा पहला प्रयास। आप संशोधन इतिहास के माध्यम से बहुत सारे अनुकूलन कदम देख सकते हैं। आश्चर्यजनक रूप से, यह ज्यादातर बहुत छोटे बदलाव थे। यह कोड स्थायी रूप से गणना करने पर विकिपीडिया लेख में "बालसुब्रमण्यम-बाक्स / फ्रैंकलिन-ग्लिन सूत्र" खंड में सूत्र पर आधारित है ।

pस्थायी की गणना करता है। यह उस माध्यम से कहा जाता है ptजो मैट्रिक्स को इस तरह से रूपांतरित करता है जो हमेशा मान्य होता है, लेकिन विशेष रूप से हमारे यहां मिलने वाले मैट्रेस के लिए उपयोगी होता है।

के साथ संकलित करें ghc -O2 -threaded -fllvm -feager-blackholing -o <name> <name>.hs। समानांतरकरण के साथ चलने के लिए, इसे इस तरह रनटाइम पैरामीटर दें ./<name> +RTS -N:। इनपुट स्टैड से है नेस्टेड कॉमा से अलग सूचियों को ब्रैकेट में जैसे [[1,2],[3,4]]कि पिछले उदाहरण (हर जगह अनुमति दी गई नई सुचना)।

— क्रिश्चियन सिवर्स
स्रोत

1

मैं प्लग इन करके 20-25% की गति सुधार प्राप्त करने में सक्षम था Data.Vector। परिवर्तन बदल समारोह प्रकार को छोड़कर: import qualified Data.Vector as V, x (V.zipWith(-) p v) vs (-m) c' ), p (v:vs) = x (foldl (V.zipWith (+)) v vs) (map (V.map (2*)) vs) 1 11,main = getContents >>= print . p . map V.fromList . read

— Angs

1

@ धन्यवाद बहुत बहुत धन्यवाद! मैं वास्तव में बेहतर अनुकूल डेटाटिप्स में देखने का मन नहीं कर रहा था। यह आश्चर्यजनक है कि छोटी चीजों को कैसे बदलना है (इसका भी उपयोग करना था V.product)। उसने मुझे केवल ~ 10% दिया। कोड को बदल दिया ताकि वैक्टर में केवल Intएस हो। यह ठीक है क्योंकि वे केवल जोड़ दिए जाते हैं, बड़ी संख्या गुणा से आती है। तब यह ~ 20% था। मैंने पुराने कोड के साथ एक ही बदलाव की कोशिश की थी, लेकिन उस समय इसे धीमा कर दिया। मैंने फिर से कोशिश की क्योंकि यह अनबॉक्सिड वैक्टर का उपयोग करने की अनुमति देता है , जिससे बहुत मदद मिली!

— क्रिश्चियन सेवर्स

1

@ क्रिस्टियन-सिवर्स ग्लैब मैं मदद की हो सकती है। यहाँ एक और मजेदार भाग्य-आधारित अनुकूलन पाया गया है: x p _ m _ = m * (sum $ V.foldM' (\a b -> if b==0 then Nothing else Just $ a*fromIntegral b) 1 p)- उत्पाद एक मोनडिक गुना के रूप में जहां 0 एक विशेष मामला है। अधिक से अधिक बार फायदेमंद होने लगता है।

— आंग्स

1

@ बहुत बढ़िया! मैंने इसे एक ऐसे रूप में बदल दिया, जिसकी आवश्यकता नहीं है Transversable(मुझे लगता है कि आपके productखाने को नहीं बदलना कोई गलती नहीं थी ...) उदाहरण के लिए डेबियन स्थिर से ghc। यह इनपुट के रूप का उपयोग कर रहा है, लेकिन यह ठीक लगता है: हम इस पर भरोसा नहीं कर रहे हैं, केवल इसके लिए अनुकूलन कर रहे हैं। समय को और अधिक रोमांचक बनाता है: मेरा यादृच्छिक 30x30 मैट्रिक्स 29x29 की तुलना में थोड़ा तेज है, लेकिन फिर 31x31 4x समय लेता है। - वह ऑनलाइन मेरे लिए काम नहीं करता है। AFAIK यह पुनरावर्ती कार्यों के लिए नजरअंदाज कर दिया है।

— क्रिश्चियन सेवर्स

1

@ क्रिस्टियन-सिवर्स हाँ, मैं उस बारे में कुछ कहने वाला था product लेकिन भूल गया। ऐसा लगता है कि केवल लंबाई में भी शून्य हैं p, इसलिए विषम लंबाई के लिए हमें दोनों दुनिया का सर्वश्रेष्ठ प्राप्त करने के लिए शॉर्ट सर्किटिंग के बजाय नियमित उत्पाद का उपयोग करना चाहिए।

— 14

3

गणितज्ञ, एन। 20

p[m_] := Last[Fold[Take[ListConvolve[##, {1, -1}, 0], 2^Length[m]]&,
  Table[If[IntegerQ[Log2[k]], m[[j, Log2[k] + 1]], 0], {j, n}, {k, 0, 2^Length[m] - 1}]]]

Timingकमांड का उपयोग करते हुए , 20x20 मैट्रिक्स को मेरे सिस्टम पर लगभग 48 सेकंड की आवश्यकता होती है। यह बिल्कुल उतना ही कुशल नहीं है क्योंकि यह इस तथ्य पर निर्भर करता है कि मैट्रिक्स की प्रत्येक पंक्ति से पॉलीओमियल के उत्पाद के गुणांक के रूप में स्थायी पाया जा सकता है। कुशल बहुपद गुणन गुणांक सूचियों का निर्माण और उपयोग कर दृढ़ संकल्प का प्रदर्शन किया जाता है ListConvolve। इसके लिए O (2 ⁿ n ² ) समय की आवश्यकता होती है, यह मानते हुए कि तेजी से फूरियर रूपांतरण या इसी तरह के ओ ( एन लॉग एन ) समय की आवश्यकता होती है ।

— मील की दूरी पर
स्रोत

3

पायथन 2, एन = 22 [संदर्भ]

यह 'रेफरेंस' कार्यान्वयन है जिसे मैंने कल लिम्बिक के साथ साझा किया था, यह n=23उसकी मशीन पर कुछ सेकंड से बनाने में चूक जाता है, मेरी मशीन पर यह लगभग 52 सेकंड में करता है। इन गति को प्राप्त करने के लिए आपको PyPy के माध्यम से इसे चलाने की आवश्यकता है।

पहला फ़ंक्शन स्थायी समान की गणना करता है कि कैसे निर्धारक की गणना की जा सकती है, प्रत्येक सबमेट्रिक्स पर जाकर जब तक आपको 2x2 के साथ नहीं छोड़ा जाता है जिसे आप मूल नियम लागू कर सकते हैं। यह अविश्वसनीय रूप से धीमा है ।

दूसरा फ़ंक्शन Ryser फ़ंक्शन (विकिपीडिया में सूचीबद्ध दूसरा समीकरण) को लागू करने वाला एक है। सेट Sअनिवार्य रूप से संख्याओं {1,...,n}( s_listकोड में चर ) का अधिकार है।

from random import *
from time import time
from itertools import*

def perm(a): # naive method, recurses over submatrices, slow 
    if len(a) == 1:
        return a[0][0]
    elif len(a) == 2:
        return a[0][0]*a[1][1]+a[1][0]*a[0][1]
    else:
        tsum = 0
        for i in xrange(len(a)):
            transposed = [zip(*a)[j] for j in xrange(len(a)) if j != i]
            tsum += a[0][i] * perm(zip(*transposed)[1:])
        return tsum

def perm_ryser(a): # Ryser's formula, using matrix entries
    maxn = len(a)
    n_list = range(1,maxn+1)
    s_list = chain.from_iterable(combinations(n_list,i) for i in range(maxn+1))
    total = 0
    for st in s_list:
        stotal = (-1)**len(st)
        for i in xrange(maxn):
            stotal *= sum(a[i][j-1] for j in st)
        total += stotal
    return total*((-1)**maxn)


def genmatrix(d):
    mat = []
    for x in xrange(d):
        row = []
        for y in xrange(d):
            row.append([-1,1][randrange(0,2)])
        mat.append(row)
    return mat

def main():
    for i in xrange(1,24):
        k = genmatrix(i)
        print 'Matrix: (%dx%d)'%(i,i)
        print '\n'.join('['+', '.join(`j`.rjust(2) for j in a)+']' for a in k)
        print 'Permanent:',
        t = time()
        p = perm_ryser(k)
        print p,'(took',time()-t,'seconds)'

if __name__ == '__main__':
    main()

— Kade
स्रोत

मुझे लगता है कि आपको इस विवरण को "किस तरह निर्धारक की गणना की जाएगी" के समान समझना चाहिए। ऐसा लगता है कि नहीं है निर्धारकों के लिए विधि permanents के लिए धीमी है, लेकिन निर्धारकों के लिए एक धीमी गति से विधि permanents के लिए इसी तरह काम करता है (और धीरे धीरे के रूप में)।

— क्रिश्चियन सेवर्स

1

@ChristianSievers अच्छा बिंदु, मैंने इसे बदल दिया है।

— केड

2

आरपीथॉन 5.4.1, एन (32 (37 सेकंड)

from rpython.rlib.rtime import time
from rpython.rlib.rarithmetic import r_int, r_uint
from rpython.rlib.rrandom import Random
from rpython.rlib.rposix import pipe, close, read, write, fork, waitpid
from rpython.rlib.rbigint import rbigint

from math import log, ceil
from struct import pack

bitsize = len(pack('l', 1)) * 8 - 1

bitcounts = bytearray([0])
for i in range(16):
  b = bytearray([j+1 for j in bitcounts])
  bitcounts += b


def bitcount(n):
  bits = 0
  while n:
    bits += bitcounts[n & 65535]
    n >>= 16
  return bits


def main(argv):
  if len(argv) < 2:
    write(2, 'Usage: %s NUM_THREADS [N]'%argv[0])
    return 1
  threads = int(argv[1])

  if len(argv) > 2:
    n = int(argv[2])
    rnd = Random(r_uint(time()*1000))
    m = []
    for i in range(n):
      row = []
      for j in range(n):
        row.append(1 - r_int(rnd.genrand32() & 2))
      m.append(row)
  else:
    m = []
    strm = ""
    while True:
      buf = read(0, 4096)
      if len(buf) == 0:
        break
      strm += buf
    rows = strm.split("\n")
    for row in rows:
      r = []
      for val in row.split(' '):
        r.append(int(val))
      m.append(r)
    n = len(m)

  a = []
  for row in m:
    val = 0
    for v in row:
      val = (val << 1) | -(v >> 1)
    a.append(val)

  batches = int(ceil(n * log(n) / (bitsize * log(2))))

  pids = []
  handles = []
  total = rbigint.fromint(0)
  for i in range(threads):
    r, w = pipe()
    pid = fork()
    if pid:
      close(w)
      pids.append(pid)
      handles.append(r)
    else:
      close(r)
      total = run(n, a, i, threads, batches)
      write(w, total.str())
      close(w)
      return 0

  for pid in pids:
    waitpid(pid, 0)

  for handle in handles:
    strval = read(handle, 256)
    total = total.add(rbigint.fromdecimalstr(strval))
    close(handle)

  print total.rshift(n-1).str()

  return 0


def run(n, a, mynum, threads, batches):
  start = (1 << n-1) * mynum / threads
  end = (1 << n-1) * (mynum+1) / threads

  dtotal = rbigint.fromint(0)
  for delta in range(start, end):
    pdelta = rbigint.fromint(1 - ((bitcount(delta) & 1) << 1))
    for i in range(batches):
      pbatch = 1
      for j in range(i, n, batches):
        pbatch *= n - (bitcount(delta ^ a[j]) << 1)
      pdelta = pdelta.int_mul(pbatch)
    dtotal = dtotal.add(pdelta)

  return dtotal


def target(*args):
  return main

संकलन करने के लिए, सबसे हाल का PyPy स्रोत डाउनलोड करें, और निम्नलिखित पर अमल करें:

pypy /path/to/pypy-src/rpython/bin/rpython matrix-permanent.py

परिणामी निष्पादन योग्य का नाम matrix-permanent-cया मौजूदा कार्यशील निर्देशिका में परिचित होगा ।

PyPy 5.0 के रूप में, RPython की थ्रेडिंग प्राइमिटिव्स पहले की तुलना में बहुत कम आदिम हैं। नव स्पंदित थ्रेड्स को जीआईएल की आवश्यकता होती है, जो समानांतर गणनाओं के लिए कम या ज्यादा बेकार है। मैंने forkइसके बजाय उपयोग किया है, इसलिए यह विंडोज पर अपेक्षित रूप से काम नहीं कर सकता है, ~~हालांकि मैंने परीक्षण नहीं किया~~ है कि संकलन ( unresolved external symbol _fork) में विफल रहता है ।

निष्पादन योग्य दो कमांड लाइन मापदंडों को स्वीकार करता है। पहला थ्रेड्स की संख्या है, दूसरा वैकल्पिक पैरामीटर है n। यदि यह प्रदान किया जाता है, तो एक यादृच्छिक मैट्रिक्स उत्पन्न होगा, अन्यथा इसे स्टडिन से पढ़ा जाएगा। प्रत्येक पंक्ति को नई पंक्ति से अलग किया जाना चाहिए (अनुगामी न्यूलाइन के बिना), और प्रत्येक मान को अलग किया गया। तीसरा उदाहरण इनपुट के रूप में दिया जाएगा:

1 -1 1 -1 -1 1 1 1 -1 -1 -1 -1 1 1 1 1 -1 1 1 -1
1 -1 1 1 1 1 1 -1 1 -1 -1 1 1 1 -1 -1 1 1 1 -1
-1 -1 1 1 1 -1 -1 -1 -1 1 -1 1 1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 1 -1 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 1 -1
-1 1 1 1 -1 1 1 1 -1 -1 -1 1 -1 1 -1 1 1 1 1 1
1 -1 1 1 -1 -1 1 -1 1 1 1 1 -1 1 1 -1 1 -1 -1 -1
1 -1 -1 1 -1 -1 -1 1 -1 1 1 1 1 -1 -1 -1 1 1 1 -1
1 -1 -1 1 -1 1 1 -1 1 1 1 -1 1 -1 1 1 1 -1 1 1
1 -1 -1 -1 -1 -1 1 1 1 -1 -1 -1 -1 -1 1 1 -1 1 1 -1
-1 -1 1 -1 1 -1 1 1 -1 1 -1 1 1 1 1 1 1 -1 1 1
-1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1
1 1 -1 -1 -1 1 1 -1 -1 1 -1 1 1 -1 1 1 1 1 1 1
-1 1 1 -1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1 -1 1 -1 1
1 1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 1 -1 1 1 -1 1 -1 1
1 1 1 1 1 -1 -1 -1 1 1 1 -1 1 -1 1 1 1 -1 1 1
1 -1 -1 1 -1 -1 -1 -1 1 -1 -1 1 1 -1 1 -1 -1 -1 -1 -1
-1 1 1 1 -1 1 1 -1 -1 1 1 1 -1 -1 1 1 -1 -1 1 1
1 1 -1 -1 1 1 -1 1 1 -1 1 1 1 -1 1 1 -1 1 -1 1
1 1 1 -1 -1 -1 1 -1 -1 1 1 -1 -1 -1 1 -1 -1 -1 -1 1
-1 1 1 1 -1 -1 -1 -1 -1 -1 -1 1 1 -1 1 1 -1 1 -1 -1

नमूना उपयोग

$ time ./matrix-permanent-c 8 30
8395059644858368

real    0m8.582s
user    1m8.656s
sys     0m0.000s

तरीका

मैं का उपयोग किया है बालसुब्रमण्यन-Bax / फ्रेंकलिन-ग्लिन सूत्र , का एक क्रम जटिलता के साथ हे (2 ⁿ एन) । हालाँकि, ग्रे कोड क्रम में δ को पुन: प्रसारित करने के बजाय , मैंने वेक्टर-पंक्ति गुणन को एकल xor ऑपरेशन (मैपिंग (1, -1) → (0, 1)) से बदल दिया है। इसी तरह वेक्टर राशि एक ही ऑपरेशन में पाई जा सकती है, एन माइनस दो बार पॉपकाउंट लेकर।

— Primo
स्रोत

दुर्भाग्य से कोड bpaste.net/show/8690251167e7 के

@ लेम्बिक अपडेट किया गया जिज्ञासा से बाहर, क्या आप मुझे निम्नलिखित कोड का परिणाम बता सकते हैं? bpaste.net/show/76ec65e1b533

— Primo

यह "ट्रू 18446744073709551615 देता है" मैंने आपके बहुत अच्छे के लिए परिणामों को अब कोड के रूप में जोड़ा।

@ लेम्बिक धन्यवाद मैं 63-बिट्स को ओवरफ्लो नहीं करने के लिए गुणा को पहले ही विभाजित कर चुका था। परिणाम 8 धागे के साथ सूचीबद्ध किया गया था? 2 या 4 से फर्क पड़ता है? यदि 25 में 30 खत्म हो जाते हैं, तो ऐसा लगता है कि 31 एक मिनट से कम होना चाहिए।

— प्राइमो

-1

रैकेट 84 बाइट्स

सरल कार्य के बाद छोटे मेट्रिसेस के लिए काम करता है लेकिन बड़े मैट्रिसेस के लिए मेरी मशीन पर लटका हुआ है:

(for/sum((p(permutations(range(length l)))))(for/product((k l)(c p))(list-ref k c)))

Ungolfed:

(define (f ll) 
  (for/sum ((p (permutations (range (length ll))))) 
    (for/product ((l ll)(c p)) 
      (list-ref l c))))

कोड को असमान संख्या पंक्तियों और स्तंभों के लिए आसानी से संशोधित किया जा सकता है।

परिक्षण:

(f '[[ 1 -1 -1  1]
     [-1 -1 -1  1]
     [-1  1 -1  1]
     [ 1 -1 -1  1]])

(f '[[ 1 -1  1 -1 -1 -1 -1 -1]
 [-1 -1  1  1 -1  1  1 -1]
 [ 1 -1 -1 -1 -1  1  1  1]
 [-1 -1 -1  1 -1  1  1  1]
 [ 1 -1 -1  1  1  1  1 -1]
 [-1  1 -1  1 -1  1  1 -1]
 [ 1 -1  1 -1  1 -1  1 -1]
 [-1 -1  1 -1  1  1  1  1]])

आउटपुट:

-4
192

जैसा कि मैंने ऊपर उल्लेख किया है, यह निम्नलिखित परीक्षण पर लटका हुआ है:

(f '[[1 -1 1 -1 -1 1 1 1 -1 -1 -1 -1 1 1 1 1 -1 1 1 -1]
 [1 -1 1 1 1 1 1 -1 1 -1 -1 1 1 1 -1 -1 1 1 1 -1]
 [-1 -1 1 1 1 -1 -1 -1 -1 1 -1 1 1 1 -1 -1 -1 1 -1 -1]
 [-1 -1 -1 1 1 -1 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 1 -1]
 [-1 1 1 1 -1 1 1 1 -1 -1 -1 1 -1 1 -1 1 1 1 1 1]
 [1 -1 1 1 -1 -1 1 -1 1 1 1 1 -1 1 1 -1 1 -1 -1 -1]
 [1 -1 -1 1 -1 -1 -1 1 -1 1 1 1 1 -1 -1 -1 1 1 1 -1]
 [1 -1 -1 1 -1 1 1 -1 1 1 1 -1 1 -1 1 1 1 -1 1 1]
 [1 -1 -1 -1 -1 -1 1 1 1 -1 -1 -1 -1 -1 1 1 -1 1 1 -1]
 [-1 -1 1 -1 1 -1 1 1 -1 1 -1 1 1 1 1 1 1 -1 1 1]
 [-1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1]
 [1 1 -1 -1 -1 1 1 -1 -1 1 -1 1 1 -1 1 1 1 1 1 1]
 [-1 1 1 -1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1 -1 1 -1 1]
 [1 1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 1 -1 1 1 -1 1 -1 1]
 [1 1 1 1 1 -1 -1 -1 1 1 1 -1 1 -1 1 1 1 -1 1 1]
 [1 -1 -1 1 -1 -1 -1 -1 1 -1 -1 1 1 -1 1 -1 -1 -1 -1 -1]
 [-1 1 1 1 -1 1 1 -1 -1 1 1 1 -1 -1 1 1 -1 -1 1 1]
 [1 1 -1 -1 1 1 -1 1 1 -1 1 1 1 -1 1 1 -1 1 -1 1]
 [1 1 1 -1 -1 -1 1 -1 -1 1 1 -1 -1 -1 1 -1 -1 -1 -1 1]
 [-1 1 1 1 -1 -1 -1 -1 -1 -1 -1 1 1 -1 1 1 -1 1 -1 -1]])

— rnso
स्रोत

4

क्या यह प्रश्न इस प्रश्न के गति संस्करण के बजाय कोडगोल्फ संस्करण में बेहतर है?

जितनी जल्दी हो सके स्थायी गणना करें

gcc C ++ n (36 (मेरे सिस्टम पर 57 सेकंड)

C99, n, 33 (35 सेकंड)

परीक्षण चालन

अजगर 2, एन। 28

जंग + एक्स्ट्रीम

हास्केल, एन = 31 (54)

गणितज्ञ, एन। 20

पायथन 2, एन = 22 [संदर्भ]

आरपीथॉन 5.4.1, एन (32 (37 सेकंड)

रैकेट 84 बाइट्स