Skip to content

Natural Language Processing

Natural Language Processing Notes

Part - 1

Popular Packages :

pip install nltk
pip install spacy
pip install textblob
pip install CoreNLP

Data Sources

api
pdf
word
json
html
scraping

Twitter data can be extracted using the package tweepy

pip install tweepy

PDF file data can be processed with PyPDF2 library

pip install PyPDF2
import PyPDF2
from PyPDF2 import PdfFileReader

Word files can be read with docx package.

pip install docx

Tokenizing Words with regular expressions.

import re

re.split('\s+', 'hello world')

Part - 2

Exploring and processing text data

steps :

lowercasing
punctuation removal
stop words removal
text standardization
spelling correction
tokenization
stemming
lemmatization
eda
end to end pipeline processing

Converting text data to lowercase

using the lower() python method.

punctuations can be removed with either regex or using the replace method, punctuations can be identified with the string module.

Removing Stop Words

simplest way to do this is to use the NLTK module.

import nltk
nltk.download()
from nltk.corpus import stopwords

Standardizing Text

converting short forms to long and syntatically correct sentences.

Correcting Spelling

using textblob

Tokenizing Text

can be easily performed with libraries like nltk, spacy and textblob.

Stemming

Process of extracting root words.

from nltk.stem import PorterStemmer
st = PorterStemmer()

Lemmatizing

Extracting root word in terms of vocabulary.

available libraries :

nltk
textblob

Part - 3