攜手創作,共同成長!這是我參與「掘金日新計劃 · 8 月更文挑戰」的第4天,點擊查看活動詳情
在處理文本數據時,We usually need to do a number of different operations on it,For example appending a new string after the text、Split text into multiple strings,Or modify the capitalization of letters, etc;當然,除此之外,We will also need to use more advanced text parsing or other methods;但是,Divide text into sentences or words、Operations such as deleting or replacing certain words are the most common.
接下來,We will introduce common basic string operations with some examples.首先,define a piece of text,對其進行拆分,And make some usual edits,Finally concatenate the edited strings together for merging.
After defining the input text,Split it into individual words.Text is split with spaces、Newline as default delimiter,使用split()method to split text into individual words,Spaces do not appear in words、Newline or other specified delimiter:
>>> input_text = 'Never regret falling in love with you. The longer you go, the more you cherish it. If time can flow back to the past, I must make a love song with you again, because you are the only one in my life.'
>>> words = input_text.split()
>>> words
['Never', 'regret', 'falling', 'in', 'love', 'with', 'you.', 'The', 'longer', 'you', 'go,', 'the', 'more', 'you', 'cherish', 'it.', 'If', 'time', 'can', 'flow', 'back', 'to', 'the', 'past,', 'I', 'must', 'make', 'a', 'love', 'song', 'with', 'you', 'again,', 'because', 'you', 'are', 'the', 'only', 'one', 'in', 'my', 'life.']
復制代碼
用 “x
” Characters replace capital letters that appear in sentences.Iterate over each character of each word,對於每一個字符,if it's a capital letter,則返回一個 “x
”.This process is done with two list comprehensions,One operates on a list,The other runs on each word,and check with a conditional statement to only replace characters if they are uppercase —— 'x' if w.isupper() else w for w in word
,Use these characters at the end join()
方法連接在一起:
>>> replaced = [''.join('x' if w.isupper() else w for w in word) for word in words]
>>> replaced
['xever', 'regret', 'falling', 'in', 'love', 'with', 'you.', 'xhe', 'longer', 'you', 'go,', 'the', 'more', 'you', 'cherish', 'it.', 'xf', 'time', 'can', 'flow', 'back', 'to', 'the', 'past,', 'x', 'must', 'make', 'a', 'love', 'song', 'with', 'you', 'again,', 'because', 'you', 'are', 'the', 'only', 'one', 'in', 'my', 'life.']
復制代碼
對文本進行編碼,Convert text to plain ASCII
編碼格式,This is very important in practical applications,If not properly encoded,Unexpected errors occur when displaying.Each word is encoded as ASCII
字節序列,Then decode back again Python
字符串類型,and used when converting errors
parameter to force substitution of unknown characters:
>>> ascii_text = [word.encode('ascii',errors='replace').decode('ascii') for word in replaced]
>>> ascii_text
['xever', 'regret', 'falling', 'in', 'love', 'with', 'you.', 'xhe', 'longer', 'you', 'go,', 'the', 'more', 'you', 'cherish', 'it.', 'xf', 'time', 'can', 'flow', 'back', 'to', 'the', 'past,', 'x', 'must', 'make', 'a', 'love', 'song', 'with', 'you', 'again,', 'because', 'you', 'are', 'the', 'only', 'one', 'in', 'my', 'life.']
復制代碼
將單詞進行分組,And each group has at most 80
個字符,Each group as a row.Adds an extra newline to all words ending in a period,As a logo for different groups,After that create a new line and add words one by one;If a line has more than words 80
個字符,will end the line and start a new line,同樣,當遇到一個換行符時,Also starts a new line,We also need to add an extra space to separate words:
>>> newlines = [word + '\n' if word.endswith('.') else word for word in ascii_text]
>>> newlines
['xever', 'regret', 'falling', 'in', 'love', 'with', 'you.\n', 'xhe', 'longer', 'you', 'go,', 'the', 'more', 'you', 'cherish', 'it.\n', 'xf', 'time', 'can', 'flow', 'back', 'to', 'the', 'past,', 'x', 'must', 'make', 'a', 'love', 'song', 'with', 'you', 'again,', 'because', 'you', 'are', 'the', 'only', 'one', 'in', 'my', 'life.\n']
>>> line_size = 80
>>> lines = []
>>> line = ''
>>> for word in newlines:
... if line.endswith('\n') or len(line) + len(word) + 1 > line_size:
... lines.append(line)
... line = ''
... line = line + ' ' + word
復制代碼
最後,Format each row as a header(每個單詞的第一個字母大寫),and concatenate them into a piece of text:
>>> lines = [line.title() for line in lines]
>>> result = ''.join(lines)
>>> print(result)
Xever Regret Falling In Love With You.
Xhe Longer You Go, The More You Cherish It.
Xf Time Can Flow Back To The Past, X Must Make A Love Song With You Again,
復制代碼
除了上述操作外,Some other useful operations you can perform on strings.例如,Strings can be sliced just like any other list,'love'[0:3]
將返回 lov
.類似於 title()
方法,可以使用 upper()
方法和 lower()
方法,Can be used to return uppercase and lowercase versions of the string, respectively:
>>> print('unicode'[0:3])
uni
>>> print('unicode'.upper())
UNICODE
>>> print('UNicode'.lower())
unicode
復制代碼