[ Python ] 영어 Text 전처리 및 유용한 Re 설명 자료

2019. 7. 9. 11:37분석 Python/구현 및 자료

728x90

 

아래에 가면 Medium 자료가 있다. 

여러 가지 잘 정리한 블로그여서, 궁금하신 분들은 가서 보시면 될 것 같다! 

 

일단 텍스트 전처리 초보자라서 Regular Expression에 굉장히 서툴러서, 여러 가지 많이 경험해보려고 자료를 찾다고 잘 정리되어 있는 것을 찾게 되었다.

 

 

한글 코드 범위

ㄱ ~ ㅎ: 0x3131 ~ 0x314e

ㅏ ~ ㅣ: 0x314 f ~ 0x3163

가 ~ 힣: 0 xac00~ 0xd7a3

import re

def test():
    s='韓子는 싫고, 한글은 nice하다. English 쵝오 -_-ㅋㅑㅋㅑ ./?!'
    hangul = re.compile('[^ ㄱ-ㅣ가-힣]+') # 한글과 띄어쓰기를 제외한 모든 글자
  # hangul = re.compile('[^ \u3131-\u3163\uac00-\ud7a3]+')  # 위와 동일
    result = hangul.sub('', s) # 한글과 띄어쓰기를 제외한 모든 부분을 제거
    print (result)
    result = hangul.findall(s) # 정규식에 일치되는 부분을 리스트 형태로 저장
    print(result)

test()


## 는 싫고 한글은 하다  쵝오 ㅋㅑㅋㅑ 
## ['韓子', ',', 'nice', '.', 'English', '-_-', './?!']


출처: https://jokergt.tistory.com/52 [Gun's Knowledge Base]

 

예: 문자, 숫자가 아닌 text를 찾아서, 삭제하기

  • [^A-Za-z0-9] \W 문자, 숫자가 아닌 것을 찾음
import re
string = "(사람)11"
re.sub('[^A-Za-z0-9가-힣]', '', string)   


'사람11'

예: 한글만 추출하기

string = "1-----(사람)!@   1"
re.sub('[^A-Za-z가-힣]', '', string)   
'사람'

 

Convert text to lowercase

input_str = "The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil."
input_str = input_str.lower()
print(input_str)
## the 5 biggest countries by population in 2017 are china, india, united states, indonesia, and brazil.

 

Remove numbers (유용하다 생각)

import re
input_str = "Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls."
result = re.sub(r'\d+', '', input_str)
print(result)
## Box A contains  red and  white balls, while Box B contains  red and  blue balls

 

Remove punctuation

The following code removes this set of symbols [!”#$%&’()*+,-./:;<=>?@[]^_`{|}~]

import string
input_str = "This &is [an] example? {of} string. with.? punctuation!!!!" # Sample string
translator = str.maketrans('', '', string.punctuation)
result = input_str.translate(translator)
print(result)

This is an example of string with punctuation

 

Remove whitespaces

input_str = "\t a string example\t   "
input_str = input_str.strip()
input_str
## 'a string example'

Tokenization

여러 가지 툴이 있다고 한다. 주로 이 미디엄에선 nltk를 다루었다.

 

import nltk  
#nltk.download('stopwords')
#nltk.download('punkt')
from nltk.corpus import stopwords  
from nltk.tokenize import word_tokenize
input_str = "NLTK is a leading platform for building Python programs to work with human language data."
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(input_str)
result = [i for i in tokens if not i in stop_words]
print (result)

## ['NLTK', 'leading', 'platform', 'building', 'Python', 'programs', 'work', 'human', 'language', 'data', '.']

A scikit-learn tool , SpaCy also provides a stop words list

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
result = [i for i in tokens if not i in ENGLISH_STOP_WORDS]
print (result)

## ['NLTK', 'leading', 'platform', 'building', 'Python', 'programs', 'work', 'human', 'language', 'data', '.']

from spacy.lang.en.stop_words import STOP_WORDS
ENGLISH_STOP_WORDS
1
result = [i for i in tokens if not i in STOP_WORDS]
2
print (result)
## ['NLTK', 'leading', 'platform', 'building', 'Python', 'programs', 'work', 'human', 'language', 'data', '.']

 

Stemming

  • Stemming is a process of reducing words to their word stem, base or root form

 

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer= PorterStemmer()
input_str= "There are several types of stemming algorithms."
input_str=word_tokenize(input_str)
for word in input_str:
    print("before : " , word )
    print("after : " ,stemmer.stem(word))
    print("="*20)

before :  There
after :  there
====================
before :  are
after :  are
====================
before :  several
after :  sever
====================
before :  types
after :  type
====================
before :  of
after :  of
====================
before :  stemming
after :  stem
====================
before :  algorithms
after :  algorithm
====================
before :  .
after :  .
====================

Lemmatization

  • The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form.
  • As opposed to stemming, lemmatization does not simply chop off inflections.
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
# nltk.download('wordnet')
lemmatizer=WordNetLemmatizer()
input_str= "been had done languages cities mice"
input_str=word_tokenize(input_str)
for word in input_str:
    print("before : ", word)
    print("after : " , lemmatizer.lemmatize(word))
    print("="*20)
    
    
before :  been
after :  been
====================
before :  had
after :  had
====================
before :  done
after :  done
====================
before :  languages
after :  language
====================
before :  cities
after :  city
====================
before :  mice
after :  mouse
====================

Part of speech tagging (POS)

  • Part-of-speech tagging aims to assign parts of speech to each word of a given text (such as nouns, verbs, adjectives, and others) based on its definition and its context.
#nltk.download('averaged_perceptron_tagger')
input_str = "Parts of speech examples: an article, to write, interesting, easily, and, of"
from textblob import TextBlob
result = TextBlob(input_str)
print(result.tags)
## [('Parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('examples', 'NNS'), ('an', 'DT'), ('article', 'NN'), ('to', 'TO'), ('write', 'VB'), ('interesting', 'VBG'), ('easily', 'RB'), ('and', 'CC'), ('of', 'IN')]

 

Chunking (shallow parsing)

  • Chunking is a natural language process that identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and links them to higher order units that have discrete grammatical meanings
input_str= "A black television and a white stove were bought for the new apartment of John."
from textblob import TextBlob
result = TextBlob(input_str)
print(result.tags)

[('A', 'DT'), ('black', 'JJ'), ('television', 'NN'), ('and', 'CC'), ('a', 'DT'), ('white', 'JJ'), ('stove', 'NN'), ('were', 'VBD'), ('bought', 'VBN'), ('for', 'IN'), ('the', 'DT'), ('new', 'JJ'), ('apartment', 'NN'), ('of', 'IN'), ('John', 'NNP')]

The second step is chunking:

잘 보면 DT JJ NN을 한쌍으로 규칙 적으러 묶은 것을 확인할 수 있다.

reg_exp = "NP: {<DT>?<JJ>*<NN>}"
rp = nltk.RegexpParser(reg_exp)
result = rp.parse(result.tags)
print(result)


(S
  (NP A/DT black/JJ television/NN)
  and/CC
  (NP a/DT white/JJ stove/NN)
  were/VBD
  bought/VBN
  for/IN
  (NP the/DT new/JJ apartment/NN)
  of/IN
  John/NNP)

Named entity recognition

  • Named-entity recognition (NER) aims to find named entities in text and classify them into pre-defined categories (names of persons, locations, organizations, times, etc.).
#nltk.download('maxent_ne_chunker')
#nltk.download('words')
from nltk import word_tokenize, pos_tag, ne_chunk
input_str = "Bill works for Apple so he went to Boston for a conference."
print(ne_chunk(pos_tag(word_tokenize(input_str))))


(S
  (PERSON Bill/NNP)
  works/VBZ
  for/IN
  Apple/NNP
  so/IN
  he/PRP
  went/VBD
  to/TO
  (GPE Boston/NNP)
  for/IN
  a/DT
  conference/NN
  ./.)

 

 

https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908

 

Text Preprocessing in Python: Steps, Tools, and Examples

by Olga Davydova, Data Monsters

medium.com

 

 

728x90