Recently, I implemented a notebook in Kaggle to classify youtube videos into their keywords (tags) with the help of the video titles. Here I aim to explain how I did that and share the code with some explanations.
The dataset is youtube video metadata that exists in Kaggle:
https://www.kaggle.com/code/oladazimi/video-keyword-generator-acc-79/data?select=videos-stats.csv
We need the Title and Keyword columns from the above CSV in this post. So let’s load the data and get these two columns:
import pandas as pd
video_stats_data = pd.read_csv("/kaggle/input/youtube-statistics/videos-stats.csv")
videos_titles = list(video_stats_data["Title"].values)
keywords = list(video_stats_data["Keyword"].values)
The Title column is a mixture of the video title and the content provider name separated by the | character. Since we only need the actual title, we need to extract it from the string:
titles = [v_t.split('|')[0] for v_t in videos_titles]
Data pre-processing
Before the actual classification, we need to process the string data with some Natural Language Processing (NLP) techniques.
The first step is to download English stopwords. Stopwords are the ones that do not have any effect on the text context. For instance, consider the text “This is a book”. The words this, is, a are stopwords so if we remove them the sentence is shortly about “book”. Here is the code:
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
eng_stopwords = stopwords.words('english')
Note: This is a very problem-related matter. For example, the word “not” is a stopword by default. But it can be important in some contexts since it is negating the sentence.
The next step is to clean the data. The cleaning process includes:
- Remove all non-alphabetic characters
- Make all words lowercase
- Remove stopwords
- Stemming
Note: These steps are very problem-related and depend on your goal.
We need a list to put all video titles in it to build our corpus. So we loop over all video titles and clean each one of them.
import re
from nltk.stem.porter import PorterStemmer
corpus = []
for vt in titles:
cleaned_title = re.sub('[^a-zA-Z]', ' ', vt)
cleaned_title = cleaned_title.lower()
cleaned_title = cleaned_title.split()
stemmer = PorterStemmer()
cleaned_title = [stemmer.stem(token) for token in cleaned_title if not token in set(eng_stopwords)]
corpus.append(' '.join(cleaned_title))
Here is an explanation for each line in the for loop:
- Line1: We remove all non-alphabetic characters with help of the regular expression library (re) in python.
- Line2: make all alphabetic characters lowercase
- Line3: Split the title string based on white space. This way we tokenize each title. For example, “This is a book” becomes [“this”, “is”, “a”, “book”]
- Line4: Stemming preparation using the nltk PorterStemmer class. Stemming means that we transfer all words with the same root to one. For instance, “produce”, “production”, and “producing” will become “produc” after the Stemming. This is important since these words are reflecting the same meaning in different forms.
- Line5: This inline for loop runs both the Stemming and removing stopwords at the same time.
- Line6: In the end, we join the tokens to rebuild the string (reverse for Line3) and add it to the corpus array.
Now that our clean corpus is ready, we can move on to the next step which is modeling.
Bag Of Words
Until now our corpus is a list of strings. However, to use a Machine Learning model, we need to transfer them to numeric values or in other words vectorize them. To do this we use the bag-of-words vectorization. In this approach, we will create a large array of all words in our corpus (all video titles). Then we represent each title in the shape of a vector based on the corpus. For example, let’s say our corpus list of words is:
[“video”, “table”, “book”, “pen”, “produce”]
Now let’s say our title is “book produce book”. The numeric array for this title based on the corpus would be: [0, 0, 2, 0, 1]
(two “book” words and one “produce” based on their position in the corpus list of words)
To do this, we use CountVectorizer from the sklearn in python:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1500)
X = cv.fit_transform(corpus).toarray()
We also need to encode our labels (keywords) to numbers:
from sklearn.preprocessing import LabelEncoder
lbe = LabelEncoder()
y = lbe.fit_transform(keywords)
Classification
Now that we have everything ready, we can split our data to train and test:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
For the classification, I used Logistic Regression since it gave me a better outcome.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
lr_reg = LogisticRegression(multi_class='ovr', solver='liblinear')
lr_reg.fit(X_train, y_train)
y_pred = lr_reg.predict(X_test)
acc = accuracy_score(y_test, y_pred)
The final accuracy is 79%
In the end, we do a CAP analysis to test our accuracy:
import numpy as np
model_y = [y for _, y in sorted(zip(y_pred, y_test), reverse=True)]
nb_y = np.append([0], np.cumsum(model_y))
half_x = int((50 * len(y_test) / 100))
cap = nb_y[half_x] * 100 / max(nb_y)
The CAP result is around 71%
Here is the link to the full notebook code:
https://www.kaggle.com/code/oladazimi/video-keyword-generator-acc-79
The End.
One thought on “Video Classification based on Title”
Comments are closed.