파이썬_워드클라우드 2

DATA_STUDY 2024. 1. 22. 18:48

파일 불러오기

# 샘플 텍스트 데이터
text1 = """Natural language processing (NLP) is a field
of computer science, artificial intelligence,
and computational linguistics concerned with
the interactions between computers and human
(natural) languages."""

text2 = """자연어 처리(Natural Language Processing, NLP)는
인간의 언어 현상을 컴퓨터와 같은 기계를 이용하여
모사할 수 있도록 하는 인공 지능의 하위 분야 중 하나입니다."""

한글 설정

!pip install konlpy
!sudo apt-get install -y fonts-nanum
!sudo fc-cache -fv
!rm ~/.cache/matplotlib -rf

데이터 전처리

# NLTK의 데이터 다운로드 (한 번만 수행하면 됨)
# nltk.download('stopwords')
# nltk.download('wordnet')

1)토큰화

# 영어
import nltk

# 말뭉치 다운로드
nltk.download('punkt')

# 텍스트 토큰화 (Tokenization)
tokens1 = nltk.tokenize.word_tokenize(text1)
print(tokens1)

#한글

from konlpy.tag import Okt

# KoNLPy에서 Okt 형태소 분석기를 사용
okt = Okt()

# 텍스트 토큰화 (Tokenization)
tokens2 = okt.morphs(text2)
print(tokens2)

2) 정제 & 정규화

# 정제 (Cleaning) 및 정규화 (Normalization)

cleaned_tokens1 = [token.lower() for token in tokens1 if token.isalnum()]
print(cleaned_tokens1)

cleaned_tokens2 = [token for token in tokens2 if token.isalnum()]
print(cleaned_tokens2)

# 어간 추출 (Stemming)
stemmer1 = nltk.stem.PorterStemmer()
stemmed_tokens1 = [stemmer1.stem(token) for token in cleaned_tokens1]
print(stemmed_tokens1)

# 표제어 추출 (Lemmatization)
nltk.download('wordnet')
lemmatizer1 = nltk.stem.WordNetLemmatizer()
lemmatized_tokens1 = [lemmatizer1.lemmatize(token) for token in cleaned_tokens1]
print(lemmatizer1)

3) 불용어 제거

# 불용어 (Stopword) 제거

nltk.download('stopwords')
stop_words1 = set(nltk.corpus.stopwords.words('english'))
filtered_tokens1 = [token for token in lemmatized_tokens1 if token not in stop_words1]
print(filtered_tokens1)

stop_words2 = set(["은", "는", "이", "가", "을", "를"])
filtered_tokens2 = [token for token in cleaned_tokens2 if token not in stop_words2]
print(filtered_tokens2)

워드클라우드

from wordcloud import WordCloud

# 워드클라우드 생성
wordcloud1 = WordCloud(width=800, height=400, background_color='white')
# 폰트 변경
wordcloud1.font_path = '/usr/share/fonts/truetype/nanum/NanumBarunGothic.ttf'

wordcloud1.generate(' '.join(filtered_tokens1))
wordcloud1.to_file("wordcloud1.png")

# 워드클라우드 출력
plt.axis('off')
plt.imshow(wordcloud1)
plt.show()

# 워드클라우드 생성
wordcloud2 = WordCloud(width=800, height=400, background_color='white')
# 폰트 변경
wordcloud2.font_path = '/usr/share/fonts/truetype/nanum/NanumBarunGothic.ttf'

wordcloud2.generate(' '.join(filtered_tokens2))
wordcloud2.to_file("wordcloud1.png")

# 워드클라우드 출력
plt.axis('off')
plt.imshow(wordcloud2)
plt.show()

from wordcloud import WordCloud
import matplotlib.pyplot as plt

# 워드클라우드 출력
plt.imshow(wordcloud2, interpolation='bilinear')
plt.axis('off')
plt.show()

저작자표시

'DATA_STUDY' 카테고리의 다른 글

데이터 분석_CNN_CIFAR10_MNIST, 시계열_RNN기본_LSTM (0)	2024.01.22
파이썬_파일읽고쓰기_워드클라우드 (1)	2024.01.22
데이터 분석_파이토치 : Pytorch3 (0)	2024.01.22
데이터 분석_파이토치 : Pytorch2 (0)	2024.01.22
데이터 분석_파이토치 : Pytorch1 (0)	2024.01.19

ABOUT ME

Data_with_U Data_with_U

데이터 전처리

'DATA_STUDY' 카테고리의 다른 글

티스토리툴바

ABOUT ME

데이터 전처리

'DATA_STUDY' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바