वन-क्लास टेक्स्ट वर्गीकरण कैसे करें?

14

मुझे एक पाठ वर्गीकरण समस्या से निपटना है। एक वेब क्रॉलर एक निश्चित डोमेन के वेबपेजों को क्रॉल करता है और प्रत्येक वेबपेज के लिए मैं यह पता लगाना चाहता हूं कि यह केवल एक विशिष्ट वर्ग का है या नहीं। यही है, अगर मैं इस वर्ग को सकारात्मक कहता हूं, तो प्रत्येक क्रॉल किए गए वेबपेज सकारात्मक या वर्ग के लिए सकारात्मक या गैर-सकारात्मक वर्ग के हैं ।

मेरे पास पहले से ही क्लास पॉजिटिव के लिए वेबपेजों का एक बड़ा प्रशिक्षण सेट है । लेकिन कक्षा गैर-सकारात्मक के लिए एक प्रशिक्षण सेट कैसे बनाया जाए जो यथासंभव प्रतिनिधि है? मेरा मतलब है, मैं मूल रूप से प्रत्येक और उस वर्ग के लिए हर चीज का उपयोग कर सकता हूं। क्या मैं सिर्फ कुछ मनमाने पन्ने जमा कर सकता हूँ जो निश्चित रूप से सकारात्मक श्रेणी के नहीं हैं ? मुझे यकीन है कि एक पाठ वर्गीकरण एल्गोरिदम का प्रदर्शन (मैं एक बेवे बेज़ एल्गोरिथ्म का उपयोग करना पसंद करता हूं) अत्यधिक इस बात पर निर्भर करता है कि मैं कक्षा गैर-सकारात्मक के लिए कौन से वेबपृष्ठ चुनता हूं ।

तो मैं क्या करूँ? क्या कोई मुझे एक सलाह दे सकता है? आपका बहुत बहुत धन्यवाद!

— pemistahl
स्रोत

यह वास्तव में दो वर्ग क्लस्टरिंग है क्योंकि आपके पास दो कक्षाएं हैं। एक वर्ग के लिए आपके पास केवल एक ही वर्ग होगा और यह आकलन करने में दिलचस्पी होगी कि आपके अवलोकन डेटा को कितनी अच्छी तरह से फिट करते हैं (यानी आउटलेर्स का पता लगाना)।

— टिम

सीखने की इस समस्या का एक नाम है - पीयू लर्निंग। यह स्वाभाविक रूप से उपयोग किया जाना है अगर सकारात्मक उदाहरण प्राप्त करना आसान या स्वाभाविक है लेकिन नकारात्मक मूल रूप से सब कुछ आराम कर रहे हैं (प्राप्त करना मुश्किल है)। सिद्धांत रूप में आप एक मानक दो क्लास क्लासिफायर सीखना चाहते हैं लेकिन एक अलग मानदंड के साथ - पीआर वक्र के तहत क्षेत्र का अनुकूलन करें। यह सॉफ्टवेयर पैकेज आपको इस तरह के एक क्लासिफायर कोड को प्रशिक्षित करने की अनुमति देता है।

— Googlep

5

जासूस ईएम एल्गोरिथ्म वास्तव में इस समस्या का हल।

एस-ईएम एक पाठ सीखने या वर्गीकरण प्रणाली है जो सकारात्मक और अप्रकाशित उदाहरणों के सेट से सीखता है (कोई नकारात्मक उदाहरण नहीं)। यह एक "जासूसी" तकनीक, भोली खाड़ी और ईएम एल्गोरिथ्म पर आधारित है।

मूल विचार यादृच्छिक रूप से क्रॉल किए गए दस्तावेज़ों के पूरे समूह के साथ अपने सकारात्मक सेट को संयोजित करना है। आप शुरू में सभी क्रॉल किए गए दस्तावेजों को नकारात्मक वर्ग के रूप में मानते हैं, और उस सेट पर एक अनुभवहीन बेयर्स क्लासिफायर सीखते हैं। अब उन क्रॉल किए गए दस्तावेजों में से कुछ वास्तव में सकारात्मक होंगे, और आप किसी भी ऐसे दस्तावेज़ को रूढ़िवादी रूप से रिलेबल कर सकते हैं, जो सबसे कम स्कोरिंग वाले वास्तविक सकारात्मक दस्तावेज़ से अधिक स्कोर किए जाते हैं। तब आप इस प्रक्रिया को तब तक दोहराते हैं जब तक कि यह स्थिर न हो जाए।

— rrenaud
स्रोत

Thanks a lot, that sounds quite promising. I'll take a look into it.

— pemistahl

6

Here is a good thesis about one-class classification:

Tax, D. M.: One-class classification - Concept-learning in the absence of counter-examples, PhD thesis, Technische Universiteit Delft, 2001. (pdf)

This thesis introduces the method of Support Vector Data Description (SVDD), a one-class support vector machine that finds a minimal hypersphere around the data rather than a hyperplane that separates the data.

The thesis also reviews other one-class classifiers.

— nub
स्रोत

Welcome to the site, @nub. We hope to build a permanent repository of statistical information, as such, we worry about the possibility of linkrot. Would you mind giving a summary of the info in that thesis in case the link goes dead?

— gung - Reinstate Monica

Thank you for summarizing. Please register & merge your accounts (you can find out how in the My Account section of our help center), then you will be able to edit & comment on your own posts.

— gung - Reinstate Monica

@gung Thanks for the welcome. I'm thrilled to have received "Yearling" badge on StackOverflow itself, so now I can comment everywhere.

— JosiahYoder-deactive except..

@JosiahYoder, if you are the OP here, please merge your accounts. You can find out how in the My Account section of our help center.

— gung - Reinstate Monica

I'm not the OP. Just a random SO user who happened across this question.

— JosiahYoder-deactive except..

1

Good training requires data that provides good estimates of the individual class probabilities. Every classification problem involves at least two classes. In your case the second class is anyone that is not in the positive class. To form a good decision boundary using Bayes or any other good method is best done with as much training data randomly selected from the class. If you do non random selection you might get a sample that doesn't truly represent the shape of the class conditional densities/distributions and could lead to a poor choice of the decision boundary.

— Michael R. Chernick
स्रोत

1

You are right, this is exactly what bothers me. How to select a sample of non-positive samples that leads to a good decision boundary? Is doing a random selection the best I can do?

— pemistahl

0

I agree with Michael.

Regarding your question about random selection; yes: you have to select randomly from the complementary set of your 'positives'. If there is any confusion that it is possible that your 'positives' are not fully defined as 'pure positive', if I may use that phrase, then you can also try at the least some kind of matched definition for positives so that you will control on those variables that are generating potentially some contamination on the definition of 'positive'. In this case you have to correspondingly match on the same variables on the 'non-positive' side also.

— crmportals
स्रोत

0

An article that may be of interest is:

"Extended nearest shrunken centroid classification: A new method for open-set authorship attribution of texts of varying sizes", Schaalje, Fields, Roper, and Snow. Literary and Linguistic Computing, vol. 26, No. 1, 2011.

Which takes a method for attributing a text to a set of authors and extends it to use the possibility that the true author is not in the candidate set. Even if you don't use the NSC method, the ideas in the paper may be useful in thinking about how to proceed.

— Greg Snow
स्रोत