[ Python ] 영어 Text 전처리 및 유용한 Re 설명 자료
2019. 7. 9. 11:37ㆍ분석 Python/구현 및 자료
아래에 가면 Medium 자료가 있다.
여러 가지 잘 정리한 블로그여서, 궁금하신 분들은 가서 보시면 될 것 같다!
일단 텍스트 전처리 초보자라서 Regular Expression에 굉장히 서툴러서, 여러 가지 많이 경험해보려고 자료를 찾다고 잘 정리되어 있는 것을 찾게 되었다.
한글 코드 범위
ㄱ ~ ㅎ: 0x3131 ~ 0x314e
ㅏ ~ ㅣ: 0x314 f ~ 0x3163
가 ~ 힣: 0 xac00~ 0xd7a3
import re
def test():
s='韓子는 싫고, 한글은 nice하다. English 쵝오 -_-ㅋㅑㅋㅑ ./?!'
hangul = re.compile('[^ ㄱ-ㅣ가-힣]+') # 한글과 띄어쓰기를 제외한 모든 글자
# hangul = re.compile('[^ \u3131-\u3163\uac00-\ud7a3]+') # 위와 동일
result = hangul.sub('', s) # 한글과 띄어쓰기를 제외한 모든 부분을 제거
print (result)
result = hangul.findall(s) # 정규식에 일치되는 부분을 리스트 형태로 저장
print(result)
test()
## 는 싫고 한글은 하다 쵝오 ㅋㅑㅋㅑ
## ['韓子', ',', 'nice', '.', 'English', '-_-', './?!']
출처: https://jokergt.tistory.com/52 [Gun's Knowledge Base]
예: 문자, 숫자가 아닌 text를 찾아서, 삭제하기
- [^A-Za-z0-9] \W 문자, 숫자가 아닌 것을 찾음
import re
string = "(사람)11"
re.sub('[^A-Za-z0-9가-힣]', '', string)
'사람11'
예: 한글만 추출하기
string = "1-----(사람)!@ 1"
re.sub('[^A-Za-z가-힣]', '', string)
'사람'
Convert text to lowercase
input_str = "The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil."
input_str = input_str.lower()
print(input_str)
## the 5 biggest countries by population in 2017 are china, india, united states, indonesia, and brazil.
Remove numbers (유용하다 생각)
import re
input_str = "Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls."
result = re.sub(r'\d+', '', input_str)
print(result)
## Box A contains red and white balls, while Box B contains red and blue balls
Remove punctuation
The following code removes this set of symbols [!”#$%&’()*+,-./:;<=>?@[]^_`{|}~]
import string
input_str = "This &is [an] example? {of} string. with.? punctuation!!!!" # Sample string
translator = str.maketrans('', '', string.punctuation)
result = input_str.translate(translator)
print(result)
This is an example of string with punctuation
Remove whitespaces
input_str = "\t a string example\t "
input_str = input_str.strip()
input_str
## 'a string example'
Tokenization
여러 가지 툴이 있다고 한다. 주로 이 미디엄에선 nltk를 다루었다.
import nltk
#nltk.download('stopwords')
#nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
input_str = "NLTK is a leading platform for building Python programs to work with human language data."
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(input_str)
result = [i for i in tokens if not i in stop_words]
print (result)
## ['NLTK', 'leading', 'platform', 'building', 'Python', 'programs', 'work', 'human', 'language', 'data', '.']
A scikit-learn tool , SpaCy also provides a stop words list
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
result = [i for i in tokens if not i in ENGLISH_STOP_WORDS]
print (result)
## ['NLTK', 'leading', 'platform', 'building', 'Python', 'programs', 'work', 'human', 'language', 'data', '.']
from spacy.lang.en.stop_words import STOP_WORDS
ENGLISH_STOP_WORDS
1
result = [i for i in tokens if not i in STOP_WORDS]
2
print (result)
## ['NLTK', 'leading', 'platform', 'building', 'Python', 'programs', 'work', 'human', 'language', 'data', '.']
Stemming
- Stemming is a process of reducing words to their word stem, base or root form
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer= PorterStemmer()
input_str= "There are several types of stemming algorithms."
input_str=word_tokenize(input_str)
for word in input_str:
print("before : " , word )
print("after : " ,stemmer.stem(word))
print("="*20)
before : There
after : there
====================
before : are
after : are
====================
before : several
after : sever
====================
before : types
after : type
====================
before : of
after : of
====================
before : stemming
after : stem
====================
before : algorithms
after : algorithm
====================
before : .
after : .
====================
Lemmatization
- The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form.
- As opposed to stemming, lemmatization does not simply chop off inflections.
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
# nltk.download('wordnet')
lemmatizer=WordNetLemmatizer()
input_str= "been had done languages cities mice"
input_str=word_tokenize(input_str)
for word in input_str:
print("before : ", word)
print("after : " , lemmatizer.lemmatize(word))
print("="*20)
before : been
after : been
====================
before : had
after : had
====================
before : done
after : done
====================
before : languages
after : language
====================
before : cities
after : city
====================
before : mice
after : mouse
====================
Part of speech tagging (POS)
- Part-of-speech tagging aims to assign parts of speech to each word of a given text (such as nouns, verbs, adjectives, and others) based on its definition and its context.
#nltk.download('averaged_perceptron_tagger')
input_str = "Parts of speech examples: an article, to write, interesting, easily, and, of"
from textblob import TextBlob
result = TextBlob(input_str)
print(result.tags)
## [('Parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('examples', 'NNS'), ('an', 'DT'), ('article', 'NN'), ('to', 'TO'), ('write', 'VB'), ('interesting', 'VBG'), ('easily', 'RB'), ('and', 'CC'), ('of', 'IN')]
Chunking (shallow parsing)
- Chunking is a natural language process that identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and links them to higher order units that have discrete grammatical meanings
input_str= "A black television and a white stove were bought for the new apartment of John."
from textblob import TextBlob
result = TextBlob(input_str)
print(result.tags)
[('A', 'DT'), ('black', 'JJ'), ('television', 'NN'), ('and', 'CC'), ('a', 'DT'), ('white', 'JJ'), ('stove', 'NN'), ('were', 'VBD'), ('bought', 'VBN'), ('for', 'IN'), ('the', 'DT'), ('new', 'JJ'), ('apartment', 'NN'), ('of', 'IN'), ('John', 'NNP')]
The second step is chunking:
잘 보면 DT JJ NN을 한쌍으로 규칙 적으러 묶은 것을 확인할 수 있다.
reg_exp = "NP: {<DT>?<JJ>*<NN>}"
rp = nltk.RegexpParser(reg_exp)
result = rp.parse(result.tags)
print(result)
(S
(NP A/DT black/JJ television/NN)
and/CC
(NP a/DT white/JJ stove/NN)
were/VBD
bought/VBN
for/IN
(NP the/DT new/JJ apartment/NN)
of/IN
John/NNP)
Named entity recognition
- Named-entity recognition (NER) aims to find named entities in text and classify them into pre-defined categories (names of persons, locations, organizations, times, etc.).
#nltk.download('maxent_ne_chunker')
#nltk.download('words')
from nltk import word_tokenize, pos_tag, ne_chunk
input_str = "Bill works for Apple so he went to Boston for a conference."
print(ne_chunk(pos_tag(word_tokenize(input_str))))
(S
(PERSON Bill/NNP)
works/VBZ
for/IN
Apple/NNP
so/IN
he/PRP
went/VBD
to/TO
(GPE Boston/NNP)
for/IN
a/DT
conference/NN
./.)
https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908
728x90
'분석 Python > 구현 및 자료' 카테고리의 다른 글
[ Python ] Regex 유용한 팁들! (0) | 2019.07.20 |
---|---|
[ Python ] 모델 예측 TOP N번까지 값 뽑고 N번까지 Accuracy 계산하기 (0) | 2019.07.16 |
[ Python ] 메모리 누수 해결에 도움되는 패키지 소개 (0) | 2019.06.05 |
[ Python ] Python에서도 R처럼 data.table을 사용할 수 있어요. (0) | 2019.06.02 |
[ Python ] UMAP (Uniform Manifold Approximation and Projection) (0) | 2019.05.22 |