R में 'tm' (टेक्स्ट माइनिंग) पैकेज में वेक्टरसोर्स और VCorpus क्या है

9

मुझे बिल्कुल यकीन नहीं है कि वास्तव में वेक्टरसोर्स और वीसीकॉर्पस 'टीएम' पैकेज में हैं।

इन पर प्रलेखन अस्पष्ट है, क्या कोई मुझे सरल शब्दों में समझा सकता है?

r text-mining

12

"कॉर्पस" पाठ दस्तावेजों का एक संग्रह है।

VCM में tm का तात्पर्य "वाष्पशील" कॉर्पस से है जिसका अर्थ है कि कॉर्पस को मेमोरी में संग्रहीत किया जाता है और आर ऑब्जेक्ट से नष्ट होने पर नष्ट हो जाएगा।

PCorpus या स्थायी Corpus के साथ इसका विरोध करें जो एक db में मेमोरी के बाहर संग्रहीत हैं।

Tm का उपयोग करके एक VCorpus बनाने के लिए, हमें VCorpus विधि के लिए एक "स्रोत" ऑब्जेक्ट को पैरामेटर के रूप में पास करना होगा। आप इस विधि का उपयोग करके उपलब्ध स्रोतों को पा सकते हैं -
getSources ()

[१] "डेटाफ्रेमस्रोस" "ड्यूरसोर्स" "यूरीस्रोस" "वेक्टरसोर्स"
[५] "XMLSource" "जिपसोर्स"

स्रोत अमूर्त इनपुट स्थानों, जैसे एक निर्देशिका, या एक यूआरआई आदि वेक्टरस्रोस केवल चरित्र वैक्टर के लिए है

एक सरल उदाहरण:

कहो तो आपके पास एक वेक्टर है -

इनपुट <- c ('यह लाइन एक है।', 'और यह दूसरा है')

स्रोत बनाएं - vecSource <- VectorSource (इनपुट)

फिर कॉर्पस बनाएँ - VCorpus (vecSource)

उम्मीद है की यह मदद करेगा। आप यहाँ और अधिक पढ़ सकते हैं - https://cran.r-project.org/web/packages/tm/vignettes/tm/pdf

— इंडी
स्रोत

5

व्यावहारिक रूप से, के बीच एक बड़ा अंतर है Corpusऔर VCorpus।

CorpusSimpleCorpusडिफ़ॉल्ट के रूप में उपयोग करता है , जिसका अर्थ है कि कुछ सुविधाएँ VCorpusउपलब्ध नहीं होंगी। एक जो तुरंत स्पष्ट है वह SimpleCorpusआपको डैश, अंडरस्कोर या विराम चिह्न के अन्य संकेतों को रखने की अनुमति नहीं देगा; SimpleCorpusया Corpusस्वचालित रूप से उन्हें हटा देता है, VCorpusनहीं करता है। इसकी अन्य सीमाएं Corpusहैं जिनकी सहायता से आप पाएंगे ?SimpleCorpus।

यहाँ एक उदाहरण है:

# Read a text file from internet
filePath <- "http://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt"
text <- readLines(filePath)

# load the data as a corpus
C.mlk <- Corpus(VectorSource(text))
C.mlk
V.mlk <- VCorpus(VectorSource(text))
V.mlk

उत्पादन होगा:

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 46
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 46

यदि आप वस्तुओं का निरीक्षण करते हैं:

# inspect the content of the document
inspect(C.mlk[1:2])
inspect(V.mlk[1:2])

आप देखेंगे कि Corpusपाठ को अनपैक किया गया है:

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 2
[1]                                                                                                                                            
[2] And so even though we face the difficulties of today and tomorrow, I still have a dream. It is a dream deeply rooted in the American dream.


<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 2
[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 0
[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 139

जबकि VCorpusवह वस्तु के भीतर एक साथ रहता है।

मान लें कि अब आप दोनों के लिए मैट्रिक्स रूपांतरण करते हैं:

dtm.C.mlk <- DocumentTermMatrix(C.mlk)
length(dtm.C.mlk$dimnames$Terms)
# 168

dtm.V.mlk <- DocumentTermMatrix(V.mlk)
length(dtm.V.mlk$dimnames$Terms)
# 187

अंत में, सामग्री देखते हैं। यह इस प्रकार है Corpus:

grep("[[:punct:]]", dtm.C.mlk$dimnames$Terms, value = TRUE)
# character(0)

और से VCorpus:

grep("[[:punct:]]", dtm.V.mlk$dimnames$Terms, value = TRUE)

[1] "alabama,"       "almighty,"      "brotherhood."   "brothers."     
 [5] "california."    "catholics,"     "character."     "children,"     
 [9] "city,"          "colorado."      "creed:"         "day,"          
[13] "day."           "died,"          "dream."         "equal."        
[17] "exalted,"       "faith,"         "gentiles,"      "georgia,"      
[21] "georgia."       "hamlet,"        "hampshire."     "happens,"      
[25] "hope,"          "hope."          "injustice,"     "justice."      
[29] "last!"          "liberty,"       "low,"           "meaning:"      
[33] "men,"           "mississippi,"   "mississippi."   "mountainside," 
[37] "nation,"        "nullification," "oppression,"    "pennsylvania." 
[41] "plain,"         "pride,"         "racists,"       "ring!"         
[45] "ring,"          "ring."          "self-evident,"  "sing."         
[49] "snow-capped"    "spiritual:"     "straight;"      "tennessee."    
[53] "thee,"          "today!"         "together,"      "together."     
[57] "tomorrow,"      "true."          "york."

शब्दों को विराम चिह्नों के साथ देखें। यह एक बहुत बड़ा फर्क है। है ना?

— f0nzie
स्रोत