मैं किसी पाठ को वाक्यों में कैसे विभाजित कर सकता हूं?

108

मेरे पास एक टेक्स्ट फाइल है। मुझे वाक्यों की सूची प्राप्त करने की आवश्यकता है।

इसे कैसे लागू किया जा सकता है? बहुत सारी सूक्ष्मताएँ हैं, जैसे कि संक्षिप्ताक्षर में प्रयुक्त होने वाली बिंदी।

मेरी पुरानी नियमित अभिव्यक्ति बुरी तरह से काम करती है:

re.compile('(\. |^|!|\?)([A-Z][^;↑\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)

python text split

— Artyom
स्रोत

18

"वाक्य" को परिभाषित करें।

— मार्टिअन

मैं यह करना चाहता हूं, लेकिन मैं जहां भी एक अवधि या एक नई

— रेखा

152

प्राकृतिक भाषा टूलकिट ( nltk.org ) में आपकी आवश्यकता है। यह समूह पोस्टिंग यह इंगित करता है:

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))

(मैंने इसे आज़माया नहीं है!)

— नेड Batchelder
स्रोत

3

@Artyom: यह शायद रूसी के साथ काम कर सकता है - एनएलटीके / pyNLTK "प्रति भाषा" (यानी गैर-अंग्रेजी), और कैसे काम कर सकता है? ।

— 1

4

@Artyom: यहां ऑनलाइन दस्तावेज़ के लिए सीधा लिंक दिया गया है nltk .tokenize.punkt.PunktSentenceTokenizer।

— मार्टिउ

10

आपको nltk.download()पहले निष्पादित करना होगा और मॉडल डाउनलोड करना होगा ->punkt

— मार्टिन थोमा

2

यह उद्धरण चिह्नों को समाप्त करने वाले मामलों पर विफल रहता है। अगर हमारे पास एक वाक्य है जो "इस तरह" समाप्त होता है।

— फोसा

1

ठीक है, आपने मुझे मना लिया। लेकिन मैंने अभी परीक्षण किया है और यह विफल नहीं लगता है। मेरा इनपुट है

'This fails on cases with ending quotation marks. If we have a sentence that ends like "this." This is another sentence.'

और मेरा आउटपुट मेरे

['This fails on cases with ending quotation marks.',  'If we have a sentence that ends like "this."',  'This is another sentence.']

लिए सही है।

— सकजानी

100

यह फ़ंक्शन हकलबेरी फिन के पूरे पाठ को लगभग 0.1 सेकंड में विभाजित कर सकता है और कई अधिक दर्दनाक किनारे के मामलों को संभालता है जो वाक्य को गैर-तुच्छ बना देता है जैसे " मि। जॉन जॉनसन जूनियर। यूएसए में पैदा हुआ था लेकिन उसने अपना पीएच अर्जित किया था। एक इंजीनियर के रूप में नाइके इंक ज्वाइन करने से पहले इजरायल में डी। उन्होंने एक व्यवसाय विश्लेषक के रूप में craigslist.org पर भी काम किया। "

# -*- coding: utf-8 -*-
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"

def split_into_sentences(text):
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

— डी ग्रीनबर्ग
स्रोत

19

यह एक भयानक समाधान है। हालाँकि मैंने नियमित अभिव्यक्तियों और पाठ के घोषणापत्र में = "([0-9])" को दो और पंक्तियाँ जोड़ दीं। re_sub (अंक + "[]।" + अंक, "\\ 1 <prd> \" समारोह में \ ", पाठ)। अब यह दशमलव जैसे 5.5 पर रेखा को विभाजित नहीं करता है। इस उत्तर के लिए धन्यवाद।

— अमेय कुलकर्णी

1

आपने पूरे हकलबेरी फिन को कैसे पार्स किया? पाठ प्रारूप में कहाँ है?

— पास्कलवूटेन

6

एक बेहतरीन उपाय। फ़ंक्शन में, मैंने पाठ में "उदा" को जोड़ा: टेक्स्ट = टेक्स्ट.रेप्लेस ("उदा", "ई <prd> g <prd>") यदि "" अर्थात "टेक्स्ट में": टेक्स्ट = टेक्स्ट.रेप्लेस ("अर्थात") , "i <prd> e <prd>") और इसने मेरी समस्या को पूरी तरह से हल कर दिया।

— 09:

3

बहुत उपयोगी टिप्पणियों के साथ महान समाधान! : बस यह थोड़ा और अधिक मजबूत बनाने के लिए हालांकि prefixes = "(Mr|St|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|Mt)[.]", websites = "[.](com|net|org|io|gov|me|edu)", औरif "..." in text: text = text.replace("...","<prd><prd><prd>")

— Dascienz

1

क्या यह फ़ंक्शन इस तरह के वाक्यों को एक वाक्य के रूप में देखने के लिए बनाया जा सकता है: जब एक बच्चा अपनी माँ से पूछता है "बच्चे कहाँ से आते हैं?", तो किसी को उसका क्या जवाब देना चाहिए?

— टि्वले

50

टेक्स्ट को वाक्यों में विभाजित करने के लिए रेगेक्स का उपयोग करने के बजाय, आप nltk लाइब्रेरी का भी उपयोग कर सकते हैं।

>>> from nltk import tokenize
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."

>>> tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']

रेफरी: https://stackoverflow.com/a/9474645/2877052

— हसन रज़ा
स्रोत

स्वीकृत उत्तर की तुलना में महान, सरल और अधिक पुन: प्रयोज्य उदाहरण।

— जे डी।

यदि आप डॉट के बाद किसी स्थान को हटाते हैं, tokenize.sent_tokenize () काम नहीं करता है, लेकिन tokenizer.tokenize () काम करता है! हम्म ...

— लियोनिद गनेलिन

1

for sentence in tokenize.sent_tokenize(text): print(sentence)

— विक्टोरिया स्टुअर्ट

11

आप रेगेक्स के बजाय स्पैसी का उपयोग करने का प्रयास कर सकते हैं । मैं इसका इस्तेमाल करता हूं और यह काम करता है।

import spacy
nlp = spacy.load('en')

text = '''Your text here'''
tokens = nlp(text)

for sent in tokens.sents:
    print(sent.string.strip())

— योगिनी
स्रोत

1

अंतरिक्ष मेगा महान है। लेकिन अगर आपको केवल पाठ को अंतरिक्ष में

— भेजने वाले

@Berlines मैं सहमत हूं, लेकिन किसी अन्य पुस्तकालय को नहीं खोज सका जो स्पासी के रूप में काम करता है। लेकिन अगर आपके पास कोई सुझाव है, तो मैं कोशिश कर सकता हूं।

— एल्फ

एडब्ल्यूएस लाम्बा सर्वर रहित उपयोगकर्ताओं के लिए भी, स्पैस के समर्थन डेटा फ़ाइलें कई 100 एमबी हैं (अंग्रेजी में बड़ी है> 400 एमबी) इसलिए आप इस तरह की चीजों का उपयोग नहीं कर सकते हैं, बहुत दुख की बात है (स्पैसिस का बहुत बड़ा प्रशंसक)

— जूलरी एच।

9

यहां सड़क के बीचों-बीच एक रास्ता है जो किसी बाहरी लाइब्रेरी पर निर्भर नहीं करता है। मैं संक्षिप्ताक्षर और टर्मिनेटर के बीच ओवरलैप्स को बाहर करने के लिए सूची बोध का उपयोग करता हूं और साथ ही समाप्ति पर भिन्नताओं के बीच ओवरलैप्स को बाहर करने के लिए, उदाहरण के लिए: '।' बनाम '' ''

abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
                 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']


def find_sentences(paragraph):
   end = True
   sentences = []
   while end > -1:
       end = find_sentence_end(paragraph)
       if end > -1:
           sentences.append(paragraph[end:].strip())
           paragraph = paragraph[:end]
   sentences.append(paragraph)
   sentences.reverse()
   return sentences


def find_sentence_end(paragraph):
    [possible_endings, contraction_locations] = [[], []]
    contractions = abbreviations.keys()
    sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
    for sentence_terminator in sentence_terminators:
        t_indices = list(find_all(paragraph, sentence_terminator))
        possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
    for contraction in contractions:
        c_indices = list(find_all(paragraph, contraction))
        contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
    possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
    if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
        max_end_start = max([pe[0] for pe in possible_endings])
        possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
    possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
    end = (-1 if not len(possible_endings) else max(possible_endings))
    return end


def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1:
            return
        yield start
        start += len(sub)

मैंने इस प्रविष्टि से कार्ल के find_all फ़ंक्शन का उपयोग किया: पायथन में एक सबरिंग की सभी घटनाएं खोजें

— TennisVisuals
स्रोत

1

सही दृष्टिकोण! दूसरों को पकड़ नहीं है ...और ?!।

— शेन स्मिसोल जूल

6

सरल मामलों के लिए (जहां वाक्य सामान्य रूप से समाप्त हो जाते हैं), यह काम करना चाहिए:

import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

रेगेक्स वह है *\. +, जो 0 या अधिक रिक्त स्थान से बाईं ओर से घिरा हुआ है और 1 या अधिक दाईं ओर है (वाक्य में परिवर्तन के रूप में गिना जा रहा re.split में अवधि की तरह कुछ को रोकने के लिए)।

जाहिर है, सबसे मजबूत समाधान नहीं है, लेकिन यह ज्यादातर मामलों में ठीक कर देगा। एकमात्र मामला जो इसे कवर नहीं करेगा, वह है संक्षिप्तीकरण (शायद वाक्यों की सूची के माध्यम से चलाएं और जांचें कि प्रत्येक स्ट्रिंग sentencesएक पूंजी के साथ शुरू होती है)?

— राफे केटलर
स्रोत

29

आप अंग्रेजी में ऐसी स्थिति के बारे में नहीं सोच सकते जहां एक वाक्य एक अवधि के साथ समाप्त नहीं होता है? कल्पना करो कि! उस पर मेरी प्रतिक्रिया होगी, "फिर से सोचें।" (देखें कि मैंने वहाँ क्या किया?)

— नेड बचेलेर

@ वाह, विश्वास नहीं होता कि मैं मूर्ख था। मुझे नशे में होना चाहिए या कुछ और।

— राफे केटलर

मैं विन 7 x86 पर पायथन 2.7.2 का उपयोग कर रहा हूं, और उपरोक्त कोड में regex मुझे यह त्रुटि देता है:, SyntaxError: EOL while scanning string literalसमापन कोष्ठक (बाद text) की ओर इशारा करते हुए । इसके अलावा, आपके द्वारा अपने पाठ में संदर्भित रेगेक्स आपके कोड नमूने में मौजूद नहीं है।

— सबुनकू

1

रेगेक्स पूरी तरह से सही नहीं है, जैसा कि होना चाहिएr' *[\.\?!][\'"\)\]]* +'

— fsociety

यह कई समस्याओं का कारण बन सकता है और साथ ही साथ छोटे विखंडू को भी सजा सकता है। इस मामले पर विचार करें कि हमारे पास "मैंने इस आइसक्रीम के लिए $ 3.5 का भुगतान किया है" उन्हें चूजों ने "मैंने $ 3 का भुगतान किया है" और "इस आइसक्रीम के लिए 5"। डिफ़ॉल्ट nltk वाक्य का उपयोग करें। टोकनर सुरक्षित है!

— रेहान_मं।

6

आप NLTK में वाक्य टोकन फ़ंक्शन का उपयोग कर सकते हैं:

from nltk.tokenize import sent_tokenize
sentence = "As the most quoted English writer Shakespeare has more than his share of famous quotes.  Some Shakespare famous quotes are known for their beauty, some for their everyday truths and some for their wisdom. We often talk about Shakespeare’s quotes as things the wise Bard is saying to us but, we should remember that some of his wisest words are spoken by his biggest fools. For example, both ‘neither a borrower nor a lender be,’ and ‘to thine own self be true’ are from the foolish, garrulous and quite disreputable Polonius in Hamlet."

sent_tokenize(sentence)

— amiref
स्रोत

2

@Artyom,

नमस्ते! आप इस फ़ंक्शन का उपयोग करके रूसी (और कुछ अन्य भाषाओं) के लिए एक नया टोकन बना सकते हैं:

def russianTokenizer(text):
    result = text
    result = result.replace('.', ' . ')
    result = result.replace(' .  .  . ', ' ... ')
    result = result.replace(',', ' , ')
    result = result.replace(':', ' : ')
    result = result.replace(';', ' ; ')
    result = result.replace('!', ' ! ')
    result = result.replace('?', ' ? ')
    result = result.replace('\"', ' \" ')
    result = result.replace('\'', ' \' ')
    result = result.replace('(', ' ( ')
    result = result.replace(')', ' ) ') 
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.strip()
    result = result.split(' ')
    return result

और फिर इसे इस तरह से कॉल करें:

text = 'вы выполняете поиск, используя Google SSL;'
tokens = russianTokenizer(text)

गुड लक, Marilena।

— मारिलैना दी बारी
स्रोत

0

इसमें कोई संदेह नहीं है कि एनएलटीके उद्देश्य के लिए सबसे उपयुक्त है। लेकिन एनएलटीके के साथ शुरुआत करना काफी दर्दनाक है (लेकिन एक बार जब आप इसे स्थापित करते हैं - आप सिर्फ पुरस्कार वापस पाते हैं)

अतः यहाँ सरल पुनः आधारित कोड उपलब्ध है http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html

# split up a paragraph into sentences
# using regular expressions


def splitParagraphIntoSentences(paragraph):
    ''' break a paragraph into sentences
        and return a list '''
    import re
    # to split by multile characters

    #   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?]')
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList


if __name__ == '__main__':
    p = """This is a sentence.  This is an excited sentence! And do you think this is a question?"""

    sentences = splitParagraphIntoSentences(p)
    for s in sentences:
        print s.strip()

#output:
#   This is a sentence
#   This is an excited sentence

#   And do you think this is a question

— vaichidrewar
स्रोत

3

हाँ, लेकिन यह इतनी आसानी से विफल हो जाता है: "श्री स्मिथ को पता है कि यह एक वाक्य है।"

— थॉमस

0

मुझे उपशीर्षक की फाइलें पढ़नी थीं और उन्हें वाक्यों में विभाजित करना था। प्री-प्रोसेसिंग (जैसे .srt फ़ाइलों में समय की जानकारी आदि को हटाने के बाद), वैरिएबल फुलफाइल में सबटाइटल फाइल का पूरा टेक्स्ट होता है। नीचे के कच्चे रास्ते ने उन्हें बड़े करीने से वाक्यों में विभाजित कर दिया। संभवतः मैं भाग्यशाली था कि वाक्य हमेशा एक स्थान के साथ समाप्त (सही ढंग से) हुए। इसे पहले आज़माएँ और यदि इसमें कोई अपवाद हो, तो अधिक जाँच और शेष जोड़ें।

# Very approximate way to split the text into sentences - Break after ? . and !
fullFile = re.sub("(\!|\?|\.) ","\\1<BRK>",fullFile)
sentences = fullFile.split("<BRK>");
sentFile = open("./sentences.out", "w+");
for line in sentences:
    sentFile.write (line);
    sentFile.write ("\n");
sentFile.close;

ओह! कुंआ। मुझे अब एहसास हुआ कि चूंकि मेरी सामग्री स्पैनिश थी, इसलिए मेरे पास "श्री स्मिथ" आदि से निपटने के मुद्दे नहीं थे, फिर भी, अगर कोई त्वरित और गंदा पार्सर चाहता है ...

— किशोर
स्रोत

0

मुझे आशा है कि यह लैटिन, चीनी, अरबी पाठ पर आपकी मदद करेगा

import re

punctuation = re.compile(r"([^\d+])(\.|!|\?|;|\n|。|！|？|；|…|　|!|؟|؛)+")
lines = []

with open('myData.txt','r',encoding="utf-8") as myFile:
    lines = punctuation.sub(r"\1\2<pad>", myFile.read())
    lines = [line.strip() for line in lines.split("<pad>") if line.strip()]

— mamtimen
स्रोत

0

इसी तरह के काम पर काम कर रहा था और इस क्वेरी में आया था, कुछ लिंक का पालन करके और नीचे दिए गए कोड के लिए कुछ अभ्यासों पर काम करके मेरे लिए जादू की तरह काम किया।

from nltk.tokenize import sent_tokenize 
  
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article"
sent_tokenize(text)

उत्पादन:

['Hello everyone.',
 'Welcome to GeeksforGeeks.',
 'You are studying NLP article']

स्रोत: https://www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-ive-ver//

— माज़ीन मुहम्मद
स्रोत