In this notebook, we will see how we can use the PySS3 Python package to ask the text classifier not only to classify a document but also to give us the list of text fragments its classification decision was based on.

Let us begin! First, we need to import the modules we will be using:

from pyss3 import SS3
from pyss3.util import Dataset

Then, before moving any further, we will unzip the training data. Since it is located in the same directory as this notebook file (extract_insight.ipynb), we could simply use the following command-line command:

!unzip -u datasets/ -d datasets/

Let’s create a new instance of the SS3 classifier. We’re going to use the same dataset that is used in the Topic Categorization tutorial. This dataset was created collecting tweets with hashtags related to these 8 different categories: “art&photography”, “beauty&fashion”, “business&finance”, “food”, “health”, “music”, “science&technology” and “sports”.

# [create a new instance of the SS3 classifier]
# Just ignore those hyperparameter values (s=0.32, l=1.24, p=1.1)
# they were obtained from the tutorial (after performing hyperparameter optimization)
# We could've been used just the default values simply with
# clf = SS3()
# but classification results would have been suboptimal (not optimized)
clf = SS3(s=0.32, l=1.24, p=1.1)

# The following lines could be replaced with just a single "clf.load_model()" in case we have
# previously saved the model elsewhere (using "clf.save_model())", but since this notebook
# is meant to be run from anywhere, we will train our model from scratch:

# Let's load the training set
x_train, y_train = Dataset.load_from_files("datasets/topic/train", folder_label=False)

, y_train, n_grams=3)
Training: 100%|██████████| 8/8 [00:36<00:00,  4.57s/it]

We will use the following example document for SS3 to give us the text parts involved in classifying it:

Effects of intensive short-term dynamic psychotherapy on social cognition in major depression

Background: Social cognition is commonly affected in psychiatric disorders and is a determinant of quality of life.

However, there are few studies of treatment.

Objective: To investigate the efficacy of intensive short-term dynamic psychotherapy on social cognition in major depression.

Method: This study used a parallel randomized control group design to compare pre-test and post-test social cognition scores between depressed participants receiving ISTDP and those allocated to a wait-list control group. Participants were adults (19–40 years of age) who were diagnosed with depression. We recruited 32 individuals, with 16 participants allocated to the ISTDP and control groups, respectively. Both groups were similar in terms of age, sex and educational level.

Results: Multivariate analysis of variance (MANOVA) demonstrated that the intervention was effective in terms of the total score of social cognition: the experimental group had a significant increase in the post-test compared to the control group. In addition, the experimental group showed a significant reduction in the negative subjective score compared to the control group as well as an improvement in response to positive neutral and negative states. Conclusion: Depressed patients receiving ISTDP show a significant improvement in social cognition post treatment compared to a wait-list control group.

We will assign it to the document variable:

Effects of intensive short-term dynamic psychotherapy on social cognition in major depression

Background: Social cognition is commonly affected in psychiatric disorders and is a determinant of quality of life. However, there are few studies of treatment.
Objective: To investigate the efficacy of intensive short-term dynamic psychotherapy on social cognition in major depression.
Method: This study used a parallel randomized control group design to compare pre-test and post-test social cognition scores between depressed participants receiving ISTDP and those allocated to a wait-list control group. Participants were adults (19–40 years of age) who were diagnosed with depression. We recruited 32 individuals, with 16 participants allocated to the ISTDP and control groups, respectively. Both groups were similar in terms of age, sex and educational level.
Results: Multivariate analysis of variance (MANOVA) demonstrated that the intervention was effective in terms of the total score of social cognition: the experimental group had a significant increase in the post-test compared to the control group. In addition, the experimental group showed a significant reduction in the negative subjective score compared to the control group as well as an improvement in response to positive neutral and negative states.
Conclusion: Depressed patients receiving ISTDP show a significant improvement in social cognition post treatment compared to a wait-list control group.

Now, before we ask SS3 to extract those relevant fragments used for classifying this document, we will ask SS3 to classify it.


Among the 8 learned category labels, SS3 decided to assign the label 'health' to it, which we, as humans, can tell it is the correct decision.

Now we are ready to ask SS3 to extract the relevant fragments for us. To do this, we will use the clf.extract_insight() method. This new method, given a document, returns the pieces of text that were involved in the classification decision, along with the confidence values associated with each (Its documentation is available here).

fragments = clf.extract_insight(document)

print("How many text fragments were extracted?", len(fragments))
How many text fragments were extracted? 17

Let’s see what the first fragment looks like…

('Effects of intensive short-term dynamic psychotherapy on social cognition in major depression',

As we can see, each returned fragment is a pair of the form (text fragment, confidence value), and therefore, if we want only the text we can select the only the first component:

print("Text:", fragments[0][0])
print("Confidence value:", fragments[0][1])
Text: Effects of intensive short-term dynamic psychotherapy on social cognition in major depression

Confidence value: 0.6793249876085043

Now, let’s take a look at the entire fragments list:

[('Effects of intensive short-term dynamic psychotherapy on social cognition in major depression',
 ('Background: Social cognition is commonly affected in psychiatric disorders and is a determinant of quality of life. However, there are few ',
 ('age) who were diagnosed with depression. We recruited 32 individuals, with 16 participants allocated to the ISTDP ',
 ('of variance (MANOVA) demonstrated that the intervention was effective in terms of the total score of social cognition: the experimental group had ',
 ('Objective: To investigate the efficacy of intensive short-term dynamic psychotherapy on social cognition in major depression.',
 ('group showed a significant reduction in the negative subjective score compared to the control group as ',
 ('group had a significant increase in the post-test compared to the control group',
 ('there are few studies of treatment.', 0.28538479404883194),
 ('Method: This study used a parallel randomized ', 0.2600748912571276),
 ('in response to positive neutral and negative states.', 0.24862272509122232),
 ('improvement in social cognition post treatment compared to a wait-list control group',
 ('Conclusion: Depressed patients receiving ISTDP show a significant improvement in social ',
 (' Participants were adults (19–40 years of ', 0.11733643643026903),
 ('post-test social cognition scores between depressed participants receiving ISTDP ',
 ('ISTDP and those allocated to a wait', 0.030070886155898154),
 ('Both groups were similar in terms of age, sex and ', 0.025867692840869892),
 ('group design to compare pre-test and ', 0.018493850304321317)]

As we can see, fragments are returned in a list that is ordered by confidence value, which is great, the further away a fragment is from the first one, the less confidence SS3 has that is relevant to the assigned category. This is really desirable since in “real life” documents will be arbitrarily long, we can always use the top n elements, for example, let’s select the top 3 elements:

[('Effects of intensive short-term dynamic psychotherapy on social cognition in major depression',
 ('Background: Social cognition is commonly affected in psychiatric disorders and is a determinant of quality of life. However, there are few ',
 ('age) who were diagnosed with depression. We recruited 32 individuals, with 16 participants allocated to the ISTDP ',

And that’s all! is it? want to go a little bit deeper? the following section will show some more advanced features the extract_insight method has, just in case some of them can be useful to you.

What about the other categories?

SS3 provides a version of the clf.classify_label(long_document) method for multi-label classification, it is called classify_multilabel. So let’s ask SS3 to try to classify again the document, but this time getting rid of the “select-only-one-category” constraint imposed by classify_label.

['health', 'science&technology']

Among the 8 learned category labels, this time, SS3 decided to assign not only the 'health' label but also science&technology too, which we, as humans, again can tell that both are correct since the document is clearly a scientific article related to health.

The problem is that, if we use extract_insight again in the same way, it will obviously show us the same result, that is, the fragments related to 'health' (the category assigned if it has to select only one), how do we tell SS3 we want extract fragments related to other categories? using the cat argument!

For instance, if we want SS3 to give us the text fragments that were used for classifying the document as science&technology, we can do as follows:

fragments = clf.extract_insight(document, cat="science&technology")

[('Method: This study used a parallel randomized control group design to compare pre-test and post',
 ('Objective: To investigate the efficacy of intensive short-term dynamic psychotherapy on social cognition in major depression.',
 ('Conclusion: Depressed patients receiving ISTDP show a significant improvement in social cognition post treatment compared to a wait-list control group.',

we can see that, unlike the previous ones, these fragments focus less on health-related aspects and much more on science/scientific ones, SS3 even gave us the Method, Objective and Conclusion well-known sections of research papers. For instance, if we read the first fragment without any context, “Method: This study used a parallel randomized control group design to compare pre-test and post”, we as humans, can clearly see it is related to science.

Just for fun, let’s force SS3 to extract the text fragments that he would use to classify the document, in a parallel universe, as sports-ish.

fragments = clf.extract_insight(document, cat="sports")

[('the negative subjective score compared to the control group as ',
 ('of the total score of social cognition: ', 0.06487662686978977),
 ('-test social cognition scores between depressed participants ',

We can see a pattern here, namely, fragments are talking about scores, which again is the logical answer.

How to control the size of the fragments?

TL;DR: Use the window_size argument!

If not given, by default window_size=3, but bigger values produce longer fragments while smaller, you guessed it! shorter ones. Let’s try out some values.

fragments = clf.extract_insight(document, window_size=0)

[('Effects of ', 0.34410723095944096),
 ('total ', 0.32683582484809587),
 ('psychiatric ', 0.2860576039598297)]
fragments = clf.extract_insight(document, window_size=1)

[('were diagnosed with depression. We ', 0.47386514201385327),
 ('Effects of intensive short', 0.3881150202849344),
 ('the total score ', 0.3268857739319143)]
fragments = clf.extract_insight(document, window_size=2)

[('Background: Social cognition is commonly affected in psychiatric disorders and is a determinant of quality ',
 ('who were diagnosed with depression. We recruited 32 individuals, with ',
 ('Effects of intensive short-term dynamic psychotherapy on ',
fragments = clf.extract_insight(document, window_size=5)

[('Multivariate analysis of variance (MANOVA) demonstrated that the intervention was effective in terms of the total score of social cognition: the experimental group had a significant increase in the post-test compared to the control group. In addition, the experimental group showed a significant reduction in the negative subjective score compared to the control group as well as an improvement in response to positive neutral and negative states.',
 ('Background: Social cognition is commonly affected in psychiatric disorders and is a determinant of quality of life. However, there are few studies of treatment.',
 ('Effects of intensive short-term dynamic psychotherapy on social cognition in major depression',

Nice, it works like a charm! but… what if I want the size of the fragments to be exactly one paragraph each? or… one sentence each? Instead of window_size, use the level argument! this argument takes exactly 3 possible values: 'paragraph', 'sentence', or the default 'word', which is used when the level argument is not given. This argument tells SS3 the “level” at which fragments are to be constructed.

For instance, let’s ask SS3 to give us the most relevant paragraph that was used for classifying the document as scientific:

fragments = clf.extract_insight(document, cat="science&technology", level="paragraph")

print("The most cool paragraph is:\n\n", fragments[0][0])
print("And its confidence value:", fragments[0][1])
The most cool paragraph is:

 Method: This study used a parallel randomized control group design to compare pre-test and post-test social cognition scores between depressed participants receiving ISTDP and those allocated to a wait-list control group. Participants were adults (19–40 years of age) who were diagnosed with depression. We recruited 32 individuals, with 16 participants allocated to the ISTDP and control groups, respectively. Both groups were similar in terms of age, sex and educational level.

And its confidence value: 1.4044308397641223

And what about the 3 most relevant sentences to 'health'?

fragments = clf.extract_insight(document, level="sentence")

[('Results: Multivariate analysis of variance (MANOVA) demonstrated that the intervention was effective in terms of the total score of social cognition: the experimental group had a significant increase in the post-test compared to the control group',
 ('Effects of intensive short-term dynamic psychotherapy on social cognition in major depression',
 ('Background: Social cognition is commonly affected in psychiatric disorders and is a determinant of quality of life',

Cool! however, what if I want to redefine what a paragraph, sentence or a word is considered to be for SS3?… well, what? OK… I guess your working with a different type of text, that is, a text that for some reason has a special format.

Let’s now suppose we are working with “weird” documents in which biggest blocks are delimited by the @ character (as if they were paragraph), and these “@-paragraph” blocks are, in turn, composed of smaller blocks delimited by the # character (as if they were sentences). Let’s also suppose that we want to analyze the following document:

weird_document="@Effects of#intensive short-term dynamic psychotherapy@on social cognition#in major depression@"

As we can see, this weird document has two “@-paragraphs” with two “#-sentences” each, if we use the extract_insight method as before, it will only return a single fragment since SS3 sees this weird document as a “normal” one, a document with a single paragraph with a single sentence:

fragments = clf.extract_insight(weird_document, level="sentence")

[('@Effects of#intensive short-term dynamic psychotherapy@on social cognition#in major depression@',

Therefore, we need to tell SS3 that we want to redefine these concepts so that “he” can be aware of those “@-paragraphs” and “#-sentences”, we can do this by using the set_block_delimiters method (documentation here), as follows:

clf.set_block_delimiters(parag="@", sent="#")

Now, let’s try again…

fragments = clf.extract_insight(weird_document, level="sentence")

[('Effects of', 0.34410723095944096),
 ('in major depression', 0.2021045058091867),
 ('intensive short-term dynamic psychotherapy', 0.10779387505953043),
 ('on social cognition', 0.025319375780346178)]

Perfect! this time, all four “#-sentences” got caught :)

Let’s see what happens with the @-paragraphs:

fragments = clf.extract_insight(weird_document, level="paragraph")

# ignore this line, just restoring the default delimiter values
# just in case you want to re-run some of the code given previously
# with the "normal document" (not the @weirdo# one)
clf.set_block_delimiters(parag="\n", sent="\.")

[('Effects of#intensive short-term dynamic psychotherapy', 0.4519011060189714),
 ('on social cognition#in major depression', 0.2274238815895329)]

As expected, it worked like a charm :D …. but… what if.. just jokin’ no more buts (for now).

Just remember that all these last sections addressed more “advanced” cases, most of the time you should be just fine with plain clf.extract_insight(document) and simply using different window_size values when needed.

BTW, wow! you've reached this far! you deserve a nice coffee, don't you? Have an awesome day.