Topic Categorization - Jupyter Notebook


To open and run this notebook online, click here:


This is the static (html) version of the notebook for the tutorial “Movie Review - Classic Workflow”.


Before we begin, let’s import needed modules…

from pyss3 import SS3
from pyss3.util import Dataset
from pyss3.server import Live_Test

from sklearn.metrics import accuracy_score

… and unzip the “movie_review.zip” dataset inside the datasets folder.

!unzip -u datasets/topic.zip -d datasets/

Ok, now we are ready to begin. Let’s create a new SS3 instance

clf = SS3()

What are the default hyperparameter values? let’s see

s, l, p, _ = clf.get_hyperparameters()

print("Smoothness(s):", s)
print("Significance(l):", l)
print("Sanction(p):", p)
Smoothness(s): 0.45
Significance(l): 0.5
Sanction(p): 1

Ok, now let’s load the training and the test set using the load_from_files function from pyss3.util. Since, in this dataset, there’s a single file for each category, we will use the argument folder_label=False to tell PySS3 to use each file as a different category and each line inside of it as a different document:

x_train, y_train = Dataset.load_from_files("datasets/topic/train", folder_label=False)
x_test, y_test = Dataset.load_from_files("datasets/topic/test", folder_label=False)

Let’s train our model…

clf.fit(x_train, y_train)
Training: 100%|██████████| 8/8 [00:29<00:00,  3.70s/it]

Note that we don’t have to create any document-term matrix! we are using just the plain x_train documents :D cool uh? (SS3 creates a language model for each category and therefore it doesn’t need to create any document-term matrices)

Now that the model has been trained, let’s test it using the documents in x_test

y_pred = clf.predict(x_test)
Classification: 100%|██████████| 800/800 [00:01<00:00, 779.66it/s]

Let’s see how good our model performed

print("Accuracy:", accuracy_score(y_pred, y_test))
Accuracy: 0.70375

Not bad using the default hyperparameter values… let’s manually analyze what this model has actually learned by using the interactive “live test”. Note that since we are not going to use the x_test for this live test(*) but instead the documents in "datasets/topic/live\_test", we must use the set_testset_from_files method to tell the server to load documents from there instead.

(*) try it if you want but since x_test contains (preprocessed) tweets, they don’t look really good and clean.

# Live_Test.run(clf, x_test, y_test)  # <- this visualization doesn't look really clean and good so, instead,
                                      #    we will use the documents in "live_test" folder:

Live_Test.set_testset_from_files("datasets/topic/live_test")

Live_Test.run(clf)

Live test doesn’t look bad, however, we will create a “more intelligent” version of this model, a version that can recognize variable-length word n-grams “on the fly”. Thus, when calling the fit we will pass an extra argument n_grams=3 to indicate we want SS3 to learn to recognize important words, bigrams, and 3-grams (*). Additionally, we will name our model “topic_categorization_3grams” so that we can save it and load it later from the PySS3 Command Line to perform the hyperparameter optimization to find better hyperparameter values.

(*) If you’re curious and want to know how this is actually done by SS3, read the paper “t-SS3: a text classifier with dynamic n-grams for early risk detection over text streams” (preprint available here).

clf = SS3(name="topic_categorization_3grams")

clf.fit(x_train, y_train, n_grams=3)  # <-- note the n_grams=3 argument here
Training: 100%|██████████| 8/8 [00:37<00:00,  4.64s/it]

As mentioned above, we will save this trained model for later use

clf.save_model()
[ saving model (ss3_models/topic_categorization_3grams.ss3m)... ]

Now let’s see if the performance has improved…

y_pred = clf.predict(x_test)
Classification: 100%|██████████| 800/800 [00:01<00:00, 734.93it/s]
print("Accuracy:", accuracy_score(y_pred, y_test))
Accuracy: 0.71875

Yeah, the accuracy slightly improved but more importantly, we should now see that the model has learned “more intelligent patterns” involving sequences of words when using the interactive “live test” (like “machine learning”, “artificial intelligence”, “self-driving cars”, etc. for the “science&technology” category). Let’s see…

Live_Test.run(clf)

Fortunately, our model has learned to recognize these important sequences (such as “artificial intelligence” and “machine learning” in doc_2.txt, “self-driving cars” in doc_6.txt, etc.). However, some documents aren’t perfectly classified, for instance, doc_3.txt was classified as “science&technology” (as a third topic) which is clearly wrong…

So, one last thing we are going to do is to try yo find better hyperparameter values to improve our model’s performance. To achieve this, we will perform what it is known as “Hyperparameter Optimization” using the PySS3 Command Line tool.

At this point you should read the Hyperparameter Optimization section of this tutorial.

As described in the “Hyperparameter Optimization” section, we found out that the following hyperparameter values will improve our classification performance

clf.set_hyperparameters(s=0.32, l=1.24, p=1.1)

Let’s see if it’s true…

y_pred = clf.predict(x_test)
Classification: 100%|██████████| 800/800 [00:09<00:00, 88.64it/s]
print("Accuracy:", accuracy_score(y_pred, y_test))
Accuracy: 0.77125

The accuracy has improved as expected :)

Let’s perform the last check and visualize what our final model has learned and how it is classifying the documents…

Live_Test.run(clf)

Perfect! now the documents are classified properly! (including doc_3.txt) :D

…and that’s it, nicely done buddy!