PySS3 Package
Main module
This is the main module containing the implementation of the SS3 classifier.
(Please, visit https://github.com/sergioburdisso/pyss3 for more info)
- exception pyss3.EmptyModelError(msg='')
Bases:
ExceptionException to be thrown when the model is empty.
- exception pyss3.InvalidCategoryError(msg='')
Bases:
ExceptionException to be thrown when a category is not valid.
- class pyss3.SS3(s=None, l=None, p=None, a=None, name='', cv_m='norm_gv_xai', sg_m='xai')
Bases:
objectThe SS3 classifier class.
The SS3 classifier was originally defined in Section 3 of https://dx.doi.org/10.1016/j.eswa.2019.05.023 (preprint avialable here: https://arxiv.org/abs/1905.08772)
- Parameters:
s (float) – the “smoothness”(sigma) hyperparameter value
l (float) – the “significance”(lambda) hyperparameter value
p (float) – the “sanction”(rho) hyperparameter value
a (float) – the alpha hyperparameter value (i.e. all terms with a confidence value (cv) less than alpha will be ignored during classification)
name (str) – the model’s name (to save and load the model from disk)
cv_m (str) – method used to compute the confidence value (cv) of each term (word or n-grams), options are: “norm_gv_xai”, “norm_gv” and “gv” (default: “norm_gv_xai”)
sg_m (str) – method used to compute the significance (sg) function, options are: “vanilla” and “xai” (default: “xai”)
- classify(doc, prep=True, sort=True, json=False, prep_func=None)
Classify a given document.
- Parameters:
doc (str) – the content of the document
prep (bool) – enables the default input preprocessing (default: True)
sort (bool) – sort the classification result (from best to worst)
json (bool) – return a debugging version of the result in JSON format.
prep_func (function) – the custom preprocessing function to be applied to the given document before classifying it. If not given, the default preprocessing function will be used (as long as
prep=True)
- Returns:
the document confidence vector if
sortis False. Ifsortis True, a list of pairs (category index, confidence value) ordered by confidence value.- Return type:
list
- Raises:
EmptyModelError
- classify_label(doc, def_cat='most-probable', labels=True, prep=True)
Classify a given document returning the category label.
- Parameters:
doc (str) – the content of the document
def_cat (str) – default category to be assigned when SS3 is not able to classify a document. Options are “most-probable”, “unknown” or a given category name. (default: “most-probable”)
labels (bool) – whether to return the category label or just the category index (default: True)
prep (bool) – enables the default input preprocessing process (default: True)
- Returns:
the category label or the category index.
- Return type:
str or int
- Raises:
InvalidCategoryError
- classify_multilabel(doc, def_cat='unknown', labels=True, prep=True)
Classify a given document returning multiple category labels.
This method could be used to perform multi-label classification. Internally, it uses k-mean clustering on the confidence vector to select the proper group of labels.
- Parameters:
doc (str) – the content of the document
def_cat (str) – default category to be assigned when SS3 is not able to classify a document. Options are “most-probable”, “unknown” or a given category name. (default: “unknown”)
labels (bool) – whether to return the category labels or just the category indexes (default: True)
prep (bool) – enables the default input preprocessing (default: True)
- Returns:
the list of category labels (or indexes).
- Return type:
list (of str or int)
- Raises:
InvalidCategoryError
- cv(ngram, cat)
Return the “confidence value” of a given word n-gram for the given category.
This value is obtained applying a final transformation on the global value of the given word n-gram using the gv function [*].
These transformation are given when creating a new SS3 instance (see the SS3 class constructor’s
cv_margument for more information).- [*] the gv function is defined in Section 3.2.2 of the original paper:
Examples:
>>> clf.cv("chicken", "food") >>> clf.cv("roast chicken", "food") >>> clf.cv("chicken", "sports")
- Parameters:
ngram (str) – the word or word n-gram
cat (str) – the category label
- Returns:
the confidence value
- Return type:
float
- Raises:
InvalidCategoryError
- extract_insight(doc, cat='auto', level='word', window_size=3, min_cv=0.01, sort=True)
Get the list of text blocks involved in the classification decision.
Given a document, return the pieces of text that were involved in the classification decision, along with the confidence values associated with them. If a category is given, perform the process as if the given category were the one assigned by the classifier.
- Parameters:
doc (str) – the content of the document
cat (str) – the category in relation to which text blocks are obtained. If not present, it will automatically use the category assigned by SS3 after classification. Options are ‘auto’ or a given category name. (default: ‘auto’)
level (str) – the level at which text blocks are going to be extracted. Options are ‘word’, ‘sentence’ or ‘paragraph’. (default: ‘word’)
window_size (int) – the number of words, before and after each identified word, to be also included along with the identified word. For instance,
window_size=0means return only individual words,window_size=1means also include the word that was before and the one that was after them. If multiple selected words are close enough for their word windows to be overlapping, then those word windows will be merged into a longer and single one. This argument is ignored whenlevelis not equal to ‘word’. (default: 3)min_cv (float) – the minimum confidence value each text block must have to be included in the output. (default 0.01)
sort (bool) – whether to return the text blocks ordered by their confidence value or not. If
sort=Falsethen blocks will be returned following the order they had in the input document. (default: True)
- Returns:
a list of pairs (text, confidence value) containing the text (blocks) involved, and to what degree (*), in the classification decision. (*) given by the confidence value
- Return type:
list
- Raises:
InvalidCategoryError, ValueError
- fit(x_train, y_train, n_grams=1, prep=True, leave_pbar=True)
Train the model given a list of documents and category labels.
- Parameters:
x_train (list (of str)) – the list of documents
y_train (list of str for singlelabel classification; list of list of str for multilabel classification.) – the list of document labels
n_grams (int) – indicates the maximum
n-grams to be learned (e.g. a value of1means only 1-grams (words),2means 1-grams and 2-grams,3, 1-grams, 2-grams and 3-grams, and so on.prep (bool) – enables the default input preprocessing (default: True)
leave_pbar (bool) – controls whether to leave the progress bar or remove it after finishing.
- Raises:
ValueError
- get_a()
Get the alpha hyperparameter value.
- Returns:
the hyperparameter value
- Return type:
float
- get_alpha()
Get the alpha hyperparameter value.
- Returns:
the hyperparameter value
- Return type:
float
- get_categories(all=False)
Get the list of category names.
- Returns:
the list of category names
- Return type:
list (of str)
- get_category_index(name)
Given its name, return the category index.
- Parameters:
name (str) – The category name
- Returns:
the category index (or
IDX_UNKNOWN_CATEGORYif the category doesn’t exist).- Return type:
int
- get_category_name(index)
Given its index, return the category name.
- Parameters:
index (int) – The category index
- Returns:
the category name (or
STR_UNKNOWN_CATEGORYif the category doesn’t exist).- Return type:
str
- get_hyperparameters()
Get hyperparameter values.
- Returns:
a tuple with hyperparameters current values (s, l, p, a)
- Return type:
tuple
- get_l()
Get the “significance” (lambda) hyperparameter value.
- Returns:
the hyperparameter value
- Return type:
float
- get_most_probable_category()
Get the name of the most probable category.
- Returns:
the name of the most probable category
- Return type:
str
- get_name()
Return the model’s name.
- Returns:
the model’s name.
- Return type:
str
- get_next_words(sent, cat, n=None)
Given a sentence, return the list of
n(possible) following words.- Parameters:
sent (str) – a sentence (e.g. “an artificial”)
cat (str) – the category name
n (int) – the maximum number of possible answers
- Returns:
a list of tuples (word, frequency, probability)
- Return type:
list (of tuple)
- Raises:
InvalidCategoryError
- get_ngrams_length()
Return the length of longest learned n-gram.
- Returns:
the length of longest learned n-gram.
- Return type:
int
- get_p()
Get the “sanction” (rho) hyperparameter value.
- Returns:
the hyperparameter value
- Return type:
float
- get_s()
Get the “smoothness” (sigma) hyperparameter value.
- Returns:
the hyperparameter value
- Return type:
float
- get_sanction()
Get the “sanction” (rho) hyperparameter value.
- Returns:
the hyperparameter value
- Return type:
float
- get_significance()
Get the “significance” (lambda) hyperparameter value.
- Returns:
the hyperparameter value
- Return type:
float
- get_smoothness()
Get the “smoothness” (sigma) hyperparameter value.
- Returns:
the hyperparameter value
- Return type:
float
- get_stopwords(sg_threshold=0.01)
Get the list of (recognized) stopwords.
- Parameters:
sg_threshold (float) – significance (sg) value used as a threshold to consider words as stopwords (i.e. words with sg <
sg_thresholdfor all categories will be considered as “stopwords”)- Returns:
a list of stopwords
- Return type:
list (of str)
- get_word(index)
Given the index, return the word.
- Parameters:
index (int) – the word index
- Returns:
the word (or
STR_UNKNOWN_WORDif the word doesn’t exist).- Return type:
int
- Return type:
str
- get_word_index(word)
Given a word, return its index.
- Parameters:
name (str) – a word
- Returns:
the word index (or
IDX_UNKNOWN_WORDif the word doesn’t exist).- Return type:
int
- gv(ngram, cat)
Return the “global value” of a given word n-gram for the given category.
- (gv function is defined in Section 3.2.2 of the original paper:
Examples:
>>> clf.gv("chicken", "food") >>> clf.gv("roast chicken", "food") >>> clf.gv("chicken", "sports")
- Parameters:
ngram (str) – the word or word n-gram
cat (str) – the category label
- Returns:
the global value
- Return type:
float
- Raises:
InvalidCategoryError
- learn(doc, cat, n_grams=1, prep=True, update=True)
Learn a new document for a given category.
- Parameters:
doc (str) – the content of the document
cat (str) – the category name
n_grams (int) – indicates the maximum
n-grams to be learned (e.g. a value of1means only 1-grams (words),2means 1-grams and 2-grams,3, 1-grams, 2-grams and 3-grams, and so on.prep (bool) – enables the default input preprocessing (default: True)
update (bool) – enables model auto-update after learning (default: True)
- load(path=None)
Load model from disk.
if a
pathis not present, the default will be used (“./”), However, if apathis given, it will not only used to load the model but also will overwrite the default path calling theSS3’sset_model_path(path)method (seeset_model_pathmethod documentation for more detail).- Parameters:
path (str) – the path to load the model from
- Raises:
OSError
- load_model(path=None)
Load model from disk.
if a
pathis not present, the default will be used (“./”), However, if apathis given, it will not only used to load the model but also will overwrite the default path calling theSS3’sset_model_path(path)method (seeset_model_pathmethod documentation for more detail).- Parameters:
path (str) – the path to load the model from
- Raises:
OSError
- lv(ngram, cat)
Return the “local value” of a given word n-gram for the given category.
- (lv function is defined in Section 3.2.2 of the original paper:
Examples:
>>> clf.lv("chicken", "food") >>> clf.lv("roast chicken", "food") >>> clf.lv("chicken", "sports")
- Parameters:
ngram (str) – the word or word n-gram
cat (str) – the category label
- Returns:
the local value
- Return type:
float
- Raises:
InvalidCategoryError
- plot_value_distribution(cat)
Plot the category’s global and local value distribution.
- Parameters:
cat (str) – the category name
- Raises:
InvalidCategoryError
- predict(x_test, def_cat=None, labels=True, multilabel=False, prep=True, leave_pbar=True)
Classify a list of documents.
- Parameters:
x_test (list (of str)) – the list of documents to be classified
def_cat (str) –
default category to be assigned when SS3 is not able to classify a document. Options are “most-probable”, “unknown” or a given category name. (default: “most-probable”, or “unknown” for
multi-label classification)
labels (bool) – whether to return the list of category names or just category indexes
multilabel (bool) – whether to perform multi-label classification or not. if enabled, for each document returns a
listof labels instead of a single label (str). If the model was trained using multilabeled data, then this argument will be ignored and set to True.prep (bool) – enables the default input preprocessing (default: True)
leave_pbar (bool) – controls whether to leave the progress bar or remove it after finishing.
- Returns:
if
labelsis True, the list of category names, otherwise, the list of category indexes.- Return type:
list (of int or str)
- Raises:
EmptyModelError, InvalidCategoryError
- predict_proba(x_test, prep=True, leave_pbar=True)
Classify a list of documents returning a list of confidence vectors.
- Parameters:
x_test (list (of str)) – the list of documents to be classified
prep (bool) – enables the default input preprocessing (default: True)
leave_pbar (bool) – controls whether to leave the progress bar after finishing or remove it.
- Returns:
the list of confidence vectors
- Return type:
list (of list of float)
- Raises:
EmptyModelError
- print_categories_info()
Print information about learned categories.
- print_hyperparameters_info()
Print information about hyperparameters.
- print_model_info()
Print information regarding the model.
- print_ngram_info(ngram)
Print debugging information about a given n-gram.
Namely, print the n-gram frequency (fr), local value (lv), global value (gv), confidence value (cv), sanction (sn) weight, significance (sg) weight.
- Parameters:
ngram (str) – the n-gram (e.g. “machine”, “machine learning”, etc.)
- save(path=None)
Save the model to disk.
if a
pathis not present, the default will be used (“./”), However, if apathis given, it will not only used to save the model but also will overwrite the default path calling theSS3’sset_model_path(path)method (seeset_model_pathmethod documentation for more detail).- Parameters:
path (str) – the path to save the model to
- Raises:
OSError
- save_cat_vocab(cat, path='./', n_grams=-1)
Save category vocabulary to disk.
- Parameters:
cat (str) – the category name
path (str) – the path in which to store the vocabulary
n_grams (int) – indicates the n-grams to be stored (e.g. only 1-grams, 2-grams, 3-grams, etc.). Default -1 stores all learned n-grams (1-grams, 2-grams, 3-grams, etc.)
- Raises:
InvalidCategoryError
- save_model(path=None)
Save the model to disk.
if a
pathis not present, the default will be used (“./”), However, if apathis given, it will not only used to save the model but also will overwrite the default path calling theSS3’sset_model_path(path)method (seeset_model_pathmethod documentation for more detail).- Parameters:
path (str) – the path to save the model to
- Raises:
OSError
- save_vocab(path='./', n_grams=-1)
Save learned vocabularies to disk.
- Parameters:
path (str) – the path in which to store the vocabularies
n_grams (int) – indicates the n-grams to be stored (e.g. only 1-grams, 2-grams, 3-grams, etc.). Default -1 stores all learned n-grams (1-grams, 2-grams, 3-grams, etc.)
- save_wordcloud(cat, top_n=100, n_grams=1, path=None, size=1024, shape='circle', palette='cartocolors.qualitative.Prism_2', color=None, background_color='white', plot=False)
Create a word cloud and save it to disk as an image.
The word cloud shows the top-n words selected by the confidence value learned by the model. In addition, individual words are sized by the learned value.
- Parameters:
cat (str) – the category label
top_n (int) – number of words to be taken into account. For instance, top 50 words (default: 100).
n_grams (int) – indicates the word n-grams to be used to create the cloud. For instance, 1 for word cloud, 2 for bigrams cloud, 3 for trigrams cloud, and so on (default: 1).
path (str) – the path to the image file in which to store the word cloud (e.g. “../../my_wordcloud.jpg”). If no path is given, by default, the image file will be stored in the current working directory as “wordcloud_topN_CAT(NGRAM).png” where N is the top_n value, CAT the category label and NGRAM indicates what n-grams populate the could.
size (int) – the size of the image in pixels (default: 1024)
shape (str) – the shape of the cloud (a FontAwesome icon name). The complete list of allowed icon names are available at https://fontawesome.com/v5.15/icons?d=gallery&p=1&m=free (default: “circle”)
palette (str) – the color palette used for coloring words by giving the palettable module and palette name (list available at https://jiffyclub.github.io/palettable/) (default: “cartocolors.qualitative.Prism_2”)
color (str) – a custom color for words (if given, overrides the color palette). The color string could be the hex color code (e.g. “#FF5733”) or the HTML color name (e.g. “tomato”). The complete list of HTML color names is available at https://www.w3schools.com/colors/colors_names.asp
background_color (str) – the background color as either the HTML color name or the hex code (default: “white”).
plot (bool) – whether or not to also plot the cloud (after saving the file) (default: False)
- Raises:
InvalidCategoryError, ValueError
- set_a(value)
Set the alpha hyperparameter value.
All terms with a confidence value (cv) less than alpha will be ignored during classification.
- Parameters:
value (float) – the hyperparameter value
- set_alpha(value)
Set the alpha hyperparameter value.
All terms with a confidence value (cv) less than alpha will be ignored during classification.
- Parameters:
value (float) – the hyperparameter value
- set_block_delimiters(parag=None, sent=None, word=None)
Overwrite the default delimiters used to split input documents into blocks.
delimiters are any regular expression from simple ones (e.g.
" ") to more complex ones (e.g.r"[^\s\w\d]"). Note: remember that there are certain reserved characters for regular expression,for example, the dot (.), in which case use the backslash to indicate you’re referring the character itself and not its interpretation (e.g.
\.)e.g.
>>> ss3.set_block_delimiters(word="\s") >>> ss3.set_block_delimiters(word="\s", parag="\n\n") >>> ss3.set_block_delimiters(parag="\n---\n") >>> ss3.set_block_delimiters(sent="\.") >>> ss3.set_block_delimiters(word="\|") >>> ss3.set_block_delimiters(word=" ")
- Parameters:
parag (str) – the paragraph new delimiter
sent (str) – the sentence new delimiter
word (str) – the word new delimiter
- set_delimiter_paragraph(regex)
Set the delimiter used to split documents into paragraphs.
Remember that there are certain reserved characters for regular expression, for example, the dot (.), in which case use the backslash to indicate you’re referring the character itself and not its interpretation (e.g.
\.)- Parameters:
regex (str) – the regular expression of the new delimiter
- set_delimiter_sentence(regex)
Set the delimiter used to split documents into sentences.
Remember that there are certain reserved characters for regular expression, for example, the dot (.), in which case use the backslash to indicate you’re referring the character itself and not its interpretation (e.g.
\.)- Parameters:
regex (str) – the regular expression of the new delimiter
- set_delimiter_word(regex)
Set the delimiter used to split documents into words.
Remember that there are certain reserved characters for regular expression, for example, the dot (.), in which case use the backslash to indicate you’re referring the character itself and not its interpretation (e.g.
\.)- Parameters:
regex (str) – the regular expression of the new delimiter
- set_hyperparameters(s=None, l=None, p=None, a=None)
Set hyperparameter values.
- Parameters:
s (float) – the “smoothness” (sigma) hyperparameter
l (float) – the “significance” (lambda) hyperparameter
p (float) – the “sanction” (rho) hyperparameter
a (float) – the alpha hyperparameter (i.e. all terms with a confidence value (cv) less than alpha will be ignored during classification)
- set_l(value)
Set the “significance” (lambda) hyperparameter value.
- Parameters:
value (float) – the hyperparameter value
- set_model_path(path)
Overwrite the default path from which the model will be loaded (or saved to).
Note: be aware that the PySS3 Command Line tool looks for a local folder called
ss3_modelsto load models. Therefore, thess3_modelsfolder will be always automatically append to the givenpath(e.g. ifpath="my/path/", it will be converted intomy/path/ss3_models).- Parameters:
path (str) – the path
- set_name(name)
Set the model’s name.
- Parameters:
name (str) – the model’s name.
- set_p(value)
Set the “sanction” (rho) hyperparameter value.
- Parameters:
value (float) – the hyperparameter value
- set_s(value)
Set the “smoothness” (sigma) hyperparameter value.
- Parameters:
value (float) – the hyperparameter value
- set_sanction(value)
Set the “sanction” (rho) hyperparameter value.
- Parameters:
value (float) – the hyperparameter value
- set_significance(value)
Set the “significance” (lambda) hyperparameter value.
- Parameters:
value (float) – the hyperparameter value
- set_smoothness(value)
Set the “smoothness” (sigma) hyperparameter value.
- Parameters:
value (float) – the hyperparameter value
- sg(ngram, cat)
Return the “significance factor” of a given word n-gram for the given category.
- (sg function is defined in Section 3.2.2 of the original paper:
Examples:
>>> clf.sg("chicken", "food") >>> clf.sg("roast chicken", "food") >>> clf.sg("chicken", "sports")
- Parameters:
ngram (str) – the word or word n-gram
cat (str) – the category label
- Returns:
the significance factor
- Return type:
float
- Raises:
InvalidCategoryError
- sn(ngram, cat)
Return the “sanction factor” of a given word n-gram for the given category.
- (sn function is defined in Section 3.2.2 of the original paper:
Examples:
>>> clf.sn("chicken", "food") >>> clf.sn("roast chicken", "food") >>> clf.sn("chicken", "sports")
- Parameters:
ngram (str) – the word or word n-gram
cat (str) – the category label
- Returns:
the sanction factor
- Return type:
float
- Raises:
InvalidCategoryError
- summary_op_ngrams(cvs)
Summary operator for n-gram confidence vectors.
By default it returns the addition of all confidence vectors. However, in case you want to use a custom summary operator, this function must be replaced as shown in the following example:
>>> def my_summary_op(cvs): >>> return cvs[0] >>> ... >>> clf = SS3() >>> ... >>> clf.summary_op_ngrams = my_summary_op
Note that any function receiving a list of vectors and returning a single vector could be used. In the above example the summary operator is replaced by the user-defined
my_summary_opwhich ignores all confidence vectors returning only the confidence vector of the first n-gram (which besides being an illustrative example, makes no real sense).- Parameters:
cvs (list (of list of float)) – a list n-grams confidence vectors
- Returns:
a sentence confidence vector
- Return type:
list (of float)
- summary_op_paragraphs(cvs)
Summary operator for paragraph confidence vectors.
By default it returns the addition of all confidence vectors. However, in case you want to use a custom summary operator, this function must be replaced as shown in the following example:
>>> def dummy_summary_op(cvs): >>> return cvs[0] >>> ... >>> clf = SS3() >>> ... >>> clf.summary_op_paragraphs = dummy_summary_op
Note that any function receiving a list of vectors and returning a single vector could be used. In the above example the summary operator is replaced by the user-defined
dummy_summary_opwhich ignores all confidence vectors returning only the confidence vector of the first paragraph (which besides being an illustrative example, makes no real sense).- Parameters:
cvs (list (of list of float)) – a list paragraph confidence vectors
- Returns:
the document confidence vector
- Return type:
list (of float)
- summary_op_sentences(cvs)
Summary operator for sentence confidence vectors.
By default it returns the addition of all confidence vectors. However, in case you want to use a custom summary operator, this function must be replaced as shown in the following example:
>>> def dummy_summary_op(cvs): >>> return cvs[0] >>> ... >>> clf = SS3() >>> ... >>> clf.summary_op_sentences = dummy_summary_op
Note that any function receiving a list of vectors and returning a single vector could be used. In the above example the summary operator is replaced by the user-defined
dummy_summary_opwhich ignores all confidence vectors returning only the confidence vector of the first sentence (which besides being an illustrative example, makes no real sense).- Parameters:
cvs (list (of list of float)) – a list sentence confidence vectors
- Returns:
a paragraph confidence vector
- Return type:
list (of float)
- train(x_train, y_train, n_grams=1, prep=True, leave_pbar=True)
Train the model given a list of documents and category labels.
- Parameters:
x_train (list (of str)) – the list of documents
y_train (list of str for singlelabel classification; list of list of str for multilabel classification.) – the list of document labels
n_grams (int) – indicates the maximum
n-grams to be learned (e.g. a value of1means only 1-grams (words),2means 1-grams and 2-grams,3, 1-grams, 2-grams and 3-grams, and so on.prep (bool) – enables the default input preprocessing (default: True)
leave_pbar (bool) – controls whether to leave the progress bar or remove it after finishing.
- Raises:
ValueError
- update(force=False)
Update model values (cv, gv, lv, etc.).
- Parameters:
force (bool) – force update (even if hyperparameters haven’t changed)
- update_values(force=False)
Update model values (cv, gv, lv, etc.).
- Parameters:
force (bool) – force update (even if hyperparameters haven’t changed)
- class pyss3.SS3Vectorizer(clf, cat, ss3_weight='only_cat', tf_weight='raw_count', top_n=None, **kwargs)
Bases:
CountVectorizerConvert a collection of text documents to a document-term matrix weighted using an SS3 model.
The weight of a term t in a document d in relation to category c is calculated by multiplying a term frequency weight (tf_weight) with an SS3-based weight (ss3_weight), as follows:
- fit_transform(raw_documents)
Learn the vocabulary dictionary and return document-term matrix.
This is equivalent to fit followed by transform, but more efficiently implemented.
Parameters
- raw_documentsiterable
An iterable which generates either str, unicode or file objects.
- yNone
This parameter is ignored.
Returns
- Xarray of shape (n_samples, n_features)
Document-term matrix.
- transform(raw_documents)
Transform documents to document-term matrix.
Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.
Parameters
- raw_documentsiterable
An iterable which generates either str, unicode or file objects.
Returns
- Xsparse matrix of shape (n_samples, n_features)
Document-term matrix.
- pyss3.key_as_int(dct)
Cast the given dictionary (numerical) keys to int.
- pyss3.kmean_multilabel_size(res)
Use k-means to tell where to split the ``SS3.classify’’’s output.
Given a
SS3.classify’s output (res), tell where to partition it into 2 clusters so that one of the cluster holds the category labels that the classifier should output when performing multi-label classification. To achieve this, implement k-means (i.e. 2-means) clustering over the category confidence values inres.- Parameters:
res (list (of sorted pairs (category, confidence value))) – the classification output of
SS3.classify- Returns:
a positive integer indicating where to split
res- Return type:
int
- pyss3.list_hash(str_list)
Return a hash value for a given list of string.
- Parameters:
str_list (list (of str)) – a list of strings (e.g. x_test)
- Returns:
an MD5 hash value
- Return type:
str
- pyss3.mad(values, n)
Median absolute deviation mean.
- pyss3.re_split_keep(regex, string)
Force the inclusion of unmatched items by re.split.
This allows keeping the original content after splitting the input document for later use (e.g. for using it from the Live Test)
- pyss3.set_verbosity(level)
Set the verbosity level.
0(quiet): do not output any message (only error messages)1(normal): default behavior, display only warning messages and progress bars2(verbose): display also the informative non-essential messages
The following built-in constants can also be used to refer to these 3 values:
VERBOSITY.QUIET,VERBOSITY.NORMAL, andVERBOSITY.VERBOSE, respectively.For example, if you want PySS3 to hide everything, even progress bars, you could simply do:
>>> import pyss3 ... >>> pyss3.set_verbosity(0) ... >>> # here's the rest of your code :D
or, equivalently:
>>> import pyss3 >>> from pyss3 import VERBOSITY ... >>> pyss3.set_verbosity(VERBOSITY.QUIET) ... >>> # here's the rest of your code :D
- Parameters:
level (int) – the verbosity level
- pyss3.sigmoid(v, l)
A sigmoid function.
- pyss3.vdiv(v0, v1)
Vectorial version of division.
- pyss3.vmax(v0, v1)
Vectorial version of max.
- pyss3.vsum(v0, v1)
Vectorial version of sum.
Submodules
pyss3.server module
SS3 classification server with visual explanations for live tests.
(Please, visit https://github.com/sergioburdisso/pyss3 for more info)
- class pyss3.server.Server
Bases:
objectSS3’s Live Test HTTP server class.
- static get_port()
Return the server port.
- Returns:
the server port
- Return type:
int
- run(x_test=None, y_test=None, port=0, browser=True, quiet=True, prep=True, prep_func=None, def_cat=None)
Wait for classification requests and serve them.
- Parameters:
clf (pyss3.SS3) – the SS3 model to be attached to this server.
x_test (list (of str)) – the list of documents to classify and visualize
y_label (list (of str)) – the list of category labels
port (int) – the port to listen on (default: random free port)
browser (bool) – if True, it automatically opens up the live test on your browser
quiet (bool) – if True, use quiet mode. Otherwise use verbose mode (default: False)
prep (bool) – enables the default input preprocessing when classifying (default: True)
prep_func (function) – the custom preprocessing function to be applied to the given document before classifying it. If not given, the default preprocessing function will be used
def_cat (str) –
default category to be assigned when SS3 is not able to classify a document. Options are “most-probable”, “unknown” or a given category name. (default: “most-probable”, or “unknown” for
multi-label classification)
- Raises:
ValueError
- static serve(clf=None, x_test=None, y_test=None, port=0, browser=True, quiet=True, prep=True, prep_func=None, def_cat=None)
Wait for classification requests and serve them.
- Parameters:
clf (pyss3.SS3) – the SS3 model to be attached to this server.
x_test (list (of str)) – the list of documents to classify and visualize
y_label (list (of str)) – the list of category labels
port (int) – the port to listen on (default: random free port)
browser (bool) – if True, it automatically opens up the live test on your browser
quiet (bool) – if True, use quiet mode. Otherwise use verbose mode (default: False)
prep (bool) – enables the default input preprocessing when classifying (default: True)
prep_func (function) – the custom preprocessing function to be applied to the given document before classifying it. If not given, the default preprocessing function will be used
def_cat (str) –
default category to be assigned when SS3 is not able to classify a document. Options are “most-probable”, “unknown” or a given category name. (default: “most-probable”, or “unknown” for
multi-label classification)
- Raises:
ValueError
- static set_model(clf)
Attach a given SS3 model to this server.
- Parameters:
clf (pyss3.SS3) – an SS3 model
- static set_testset(x_test, y_test=None, def_cat=None)
Assign the test set to visualize.
- Parameters:
x_test (list (of str)) – the list of documents to classify and visualize
y_label (list (of str)) – the list of category labels
def_cat (str) –
default category to be assigned when SS3 is not able to classify a document. Options are “most-probable”, “unknown” or a given category name. (default: “most-probable”, or “unknown” for
multi-label classification)
- Raises:
ValueError
- static set_testset_from_files(test_path, folder_label=True, sep_doc='\n')
Load the test set files to visualize from
test_path.- Parameters:
test_path (str) – the test set path
folder_label (bool) – if True, read category labels from folders, otherwise, read category labels from file names. (default: True)
- Returns:
True if category documents were found, False otherwise
- Return type:
bool
- static set_testset_from_files_multilabel(docs_path, labels_path, sep_label=None, sep_doc='\n')
Multilabel version of the
Live_Test.set_testset_from_files()function.Load test documents and category labels from disk to visualize in the Live Test tool.
- Parameters:
docs_path (str) – the file or the folder containing the test documents.
labels_path (str) – the file containing the labels for each document. * if
docs_pathis a file, then thelabels_pathfile should contain a line with the corresponding list of category labels for each document indocs_path. * ifdocs_pathis a folder containing the documents, then thelabels_pathfile should contain a line for each document and category label. Each line should have the following format:document_name<sep_label>label.sep_label (str) – the separator/delimiter used to separate either each label (if
docs_pathis a file) or the document name from its category (ifdocs_pathis a folder). (default:';'whendocs_pathis a file, the'\s+'regular expression otherwise).sep_doc (str) – the separator/delimiter used to separate each document when loading training/test documents from single file. Valid only when
folder_label=False. (default:\n')
- static start_listening(port=0)
Start listening on a port and return its number.
(If a port number is not given, it uses a random free port).
- Parameters:
port (int) – the port to listen on
- pyss3.server.content_type(ext)
Given a file extension, return the content type.
- pyss3.server.get_http_body(http_request)
Given a HTTP request, return the body.
- pyss3.server.get_http_contlength(http_request)
Given a HTTP request, return the Content-Length value.
- pyss3.server.get_http_path(http_request)
Given a HTTP request, return the resource path.
- pyss3.server.main()
The main function to be called when called from the command-line.
- pyss3.server.parse_and_sanitize(rsc_path)
Very simple function to parse and sanitize the given path.
pyss3.cmd_line module
This module lets you interact with your SS3 models through a Command Line.
(Please, visit https://github.com/sergioburdisso/pyss3 for more info)
- exception pyss3.cmd_line.ArgsParseError
Bases:
ExceptionException thrown when an error occur parsing commands arguments.
- exception pyss3.cmd_line.LoadDataError
Bases:
ExceptionException thrown when an error occur while retrieving the test data.
- class pyss3.cmd_line.SS3Prompt(completekey='tab', stdin=None, stdout=None)
Bases:
CmdPrompt main class.
- args_classify(args)
Parse classify arguments.
- args_evaluations(args)
Parse evaluations arguments.
- args_grid_search(args)
Parse grid_search arguments.
- args_k_fold(args)
Parse k_fold arguments.
- args_learn(args)
Parse learn arguments.
- args_live_test(args)
Parse live_test arguments.
- args_save(args)
Parse save arguments.
- args_set(args)
Parse set arguments.
- args_test(args)
Parse test arguments.
- args_train(args)
Parse train arguments.
- complete_evaluations(text, line, begidx, endidx)
Complete arguments for ‘grid_search’ command.
- complete_get(text, line, begidx, endidx)
Complete arguments for ‘set’ command.
- complete_grid_search(text, line, begidx, endidx)
Complete arguments for ‘grid_search’ command.
- complete_info(text, line, begidx, endidx)
Complete arguments for ‘info’ command.
- complete_k_fold(text, line, begidx, endidx)
Complete arguments for ‘grid_search’ command.
- complete_ld(text, line, begidx, endidx)
Complete arguments for ‘load’ command.
- complete_learn(text, line, begidx, endidx)
Complete arguments for ‘learn’ command.
- complete_live_test(text, line, begidx, endidx)
Complete arguments for ‘test’ command.
- complete_load(text, line, begidx, endidx)
Complete arguments for ‘load’ command.
- complete_plot(text, line, begidx, endidx)
Complete arguments for ‘plot’ command.
- complete_save(text, line, begidx, endidx)
Complete arguments for ‘save’ command.
- complete_set(text, line, begidx, endidx)
Complete arguments for ‘set’ command.
- complete_sv(text, line, begidx, endidx)
Complete arguments for ‘save’ command.
- complete_test(text, line, begidx, endidx)
Complete arguments for ‘test’ command.
- complete_train(text, line, begidx, endidx)
Complete arguments for ‘train’ command.
- default(line)
Default error message.
- do_EOF(args='')
Quit the program.
- do_classify(**kwargs)
Classify a document.
- usage:
classify [DOCUMENT_PATH]
- optional arguments:
DOCUMENT_PATH the path to the document file
- do_clone(**kwargs)
Create a copy of the current model with a given name.
- usage:
clone NEW_MODEL_NAME
- required arguments:
NEW_MODEL_NAME the new model’s name
- do_debug_term(**kwargs)
Show debugging information about a given n-gram.
Namely, print the n-gram frequency (fr), local value (lv), global value (gv), confidence value (cv), sanction (sn) weight and significance (sg) weight.
- usage:
debug_term N_GRAM
- required arguments:
N_GRAM the n-gram (word, bigram, trigram, etc.) to debug
- examples:
debug_term the debug_term potato debug_term “machine learning” debug_term “self driving car”
- do_evaluations(**kwargs)
Perform different actions linked to evaluations results.
- usage:
evaluations OPTION [PATH] [METHOD] [DEF_CAT] [P VAL [P VAL …]
- required arguments:
- OPTION indicates the action to perform
- values: {info,plot,save,remove} (default: info)
- info - show information about evaluations (including
best values).
- plot - show an interactive 3-D plot with evaluation
results in the web browser (it also save it to disk).
save - save the interactive 3-D plot to disk. remove - delete evaluations results from history
- optional arguments:
PATH the dataset path used in the evaluate of interest
- METHOD the method that was used in the evaluate of interest
values: {test,K-fold} where K is an integer > 1
- DEF_CAT default category used in the evaluate of interest
values: {most-probable,unknown} or a category label
- P VAL the hyperparameter value (only for option “remove”)
P values: {s,l,p,a} VAL values: float
- examples:
- show information about all evaluations:
evaluations info
- show information about evaluations in path “a/dataset/path”:
evaluations info a/dataset/path
- information about 3-fold evaluations in path “a/dataset/path”:
evaluations info a/dataset/path 3-fold
- information about test evaluations in path “a/dataset/path”:
evaluations info a/dataset/path test
- plot evaluations:
evaluations plot
- save evaluations:
evaluations save
- remove all evaluation result(s) in path “a/dataset/path”:
evaluations remove a/dataset/path
remove 4-fold evaluation result(s) in path “a/dataset/path” with l = 1.1 and s = .45:
evaluations remove a/dataset/path 4-fold l 1.1 s .45
- do_exit(args='')
Quit the program.
- do_get(**kwargs)
Get a given hyperparameter value.
- usage:
get PARAM
- required arguments:
- PARAM the hyperparameter name
values: {s,l,p,a}
- examples:
get s get l get p get a
- do_grid_search(**kwargs)
Given a dataset, perform a grid search using the given hyperparameters values.
- usage:
grid_search PATH [LABEL] [DEF_CAT] [METHOD] P EXP [P EXP …] [no-cache]
- required arguments:
PATH the dataset path P EXP a list of values for a given hyperparameter.
- where:
P is a hyperparameter name. values: {s,l,p,a} EXP is a python expression returning a float or
a list of floats. Note: if this expression contains whitespaces, use quotations marks (e.g. “[0.5, 1.5]”)
- examples:
s [.3,.4,.5] s “[.3, .4, .5]” (Note the whitespaces and the “”) p r(.2,.8,6) (i.e. 6 points between .2 to .8)
- optional arguments:
- LABEL where to read category labels from.
values:{file,folder} (default: folder)
- DEF_CAT default category to be assigned when the model is not
able to actually classify a document. values: {most-probable,unknown} or a category label (default: most-probable)
- METHOD the method to be used
values: {test, K-fold} (default: test) where:
- K-fold indicates the number of folds to be used.
K is an integer > 1 (e.g 4-fold, 10-fold, etc.)
no-cache if present, disable the cache and recompute all the values
- examples:
grid_search a/testset/path s r(.2,.8,6) l r(.1,2,6) -p r(.5,2,6) a [0,.01] grid_search a/dataset/path 4-fold -s [.2,.3,.4,.5] -l [.5,1,1.5] -p r(.5,2,6)
- do_info(**kwargs)
Show useful information.
- usage:
info OPTION
- required arguments:
- OPTION indicates what information to show
- values: {all, parameters, categories, evaluations}
(default: all)
- examples:
info info evaluations
- do_k_fold(**kwargs)
Perform a stratified k-fold validation using the given dataset set.
- usage:
k_fold PATH [LABEL] [DEF_CAT] [N-grams] [N-fold] [P VAL …] [no-cache]
- required arguments:
PATHthe dataset path
- optional arguments:
- LABEL where to read category labels from.
values:{file,folder} (default: folder)
- DEF_CAT default category to be assigned when the model is not
able to actually classify a document. values: {most-probable,unknown} or a category label (default: most-probable)
- N-grams indicates the maximum n-grams to be learned (e.g. a
value of “1-grams” means only words will be learned; “2-grams” only 1-grams and 2-grams; “3-grams”, only 1-grams, 2-grams and 3-grams; and so on). value: {N-grams} with N integer > 0 (default: 1-grams)
- K-fold indicates the number of folds to be used.
value: {K-fold} with K integer > 1 (default: 4-fold)
- P VAL sets a hyperparameter value (e.g. s 0.45)
P values: {s,l,p,a} VAL values: float
no-cache if present, disable the cache and recompute values
- examples:
k_fold a/dataset/path 10-fold k_fold a/dataset/path 4-fold -s .45 -l 1.1 -p 1
- do_learn(**kwargs)
Learn a new document.
- usage:
learn CAT [N-grams] [DOCUMENT_PATH]
- required arguments:
CAT the category label
- optional arguments:
- N-grams indicates the maximum n-grams to be learned (e.g. a
value of “1-grams” means only words will be learned; “2-grams” only 1-grams and 2-grams; “3-grams”, only 1-grams, 2-grams and 3-grams; and so on). value: {N-grams} with N integer > 0 (default: 1-grams)
DOCUMENT_PATH the path to the document file
- do_license(args)
Print the license.
- do_live_test(**kwargs)
Interactively and graphically test the model.
- usage:
live_test [TEST_PATH [LABEL]] [verbose]
- optional arguments:
TEST_PATH the test set path
- LABEL where to read category labels from.
values: {file,folder} (default: folder)
verbose if present, run in verbose mode
- examples:
live_test live_test a/testset/path live_test a/testset/path verbose
- do_load(**kwargs)
Load a local model (given its name).
- usage:
load MODEL_NAME
- required arguments:
MODEL_NAME the model’s name
- do_new(**kwargs)
Create a new empty SS3 model with a given name.
- usage:
new MODEL_NAME
- required arguments:
MODEL_NAME the model’s name
- do_next_word(**kwargs)
Show up to 3 possible words to follow after the given sentence.
- usage:
next_word SENT
- required arguments:
SENT a sentence
- examples:
next_word “the self driving” next_word “a machine learning”
- do_plot(**kwargs)
Plot word value distribution curve or the evaluation results.
- usage:
plot OPTION
- required arguments:
- OPTION indicates what to plot
- values:
evaluations; distribution CAT;
- where:
CAT the category label
- examples:
plot distribution a_category plot evaluations
- do_rename(**kwargs)
Rename the current model with a given name.
- usage:
rename NEW_MODEL_NAME
- required arguments:
NEW_MODEL_NAME the model’s new name
- do_save(**kwargs)
Save to disk the model, learned vocabulary, evaluations results, etc.
- usage:
save OPTION
- required arguments:
- OPTION indicates what to save to disk
- values:
model; (default) evaluations; vocabulary [CAT]; stopwords [SG_THRESHOLD];
- where:
CAT the category label
- SG_THRESHOLD significance (sg) value used as a
threshold to consider words as stopwords (i.e. words with sg <
sg_thresholdfor all categories will be considered as “stopwords”) (default: .01)
- examples:
save save model save vocabulary save vocabulary a_category save stopwords save stopwords .1
- do_set(**kwargs)
Set a given hyperparameter value.
- usage:
set P VAL [P VAL …]
- required arguments:
- P VAL sets a hyperparameter value
examples: s .45; s .5; P values: {s,l,p,a} VAL values: float
- examples:
set s .5 set l 0.5 set p 2 set s .5 l 0.5 p 2
- do_test(**kwargs)
Test the model using the given test set.
- usage:
test TEST_PATH [LABEL] [DEF_CAT] [P VAL …] [no-cache]
- required arguments:
TEST_PATH the test set path
- optional arguments:
- LABEL where to read category labels from.
values:{file,folder} (default: folder)
- DEF_CAT default category to be assigned when the model is not
able to actually classify a document. values: {most-probable,unknown} or a category label (default: most-probable)
- P VAL sets a hyperparameter value
examples: s .45; s .5; P values: {s,l,p,a} VAL values: float
no-cache if present, disable the cache and recompute values
- examples:
test a/testset/path test a/testset/path -s .45 -l 1.1 -p 1 test a/testset/path unknown -s .45 -l 1.1 -p 1 no-cache
- do_train(**kwargs)
Train the model using a training set and then save it.
- usage:
train TRAIN_PATH [LABEL] [N-gram]
- required arguments:
TRAIN_PATH the training set path
- optional arguments:
- LABEL where to read category labels from.
values:{file,folder} (default: folder)
- N-grams indicates the maximum n-grams to be learned (e.g. a
value of “1-grams” means only words will be learned; “2-grams” only 1-grams and 2-grams; “3-grams”, only 1-grams, 2-grams and 3-grams; and so on). value: {N-grams} with N integer > 0 (default: 1-grams)
- examples:
train a/training/set/path 3-grams
- do_update(**kwargs)
Update model values (cv, gv, lv, etc.).
- precmd(line)
Hook method executed just before the command.
- preloop()
Hook method executed once when cmdloop() is called.
- pyss3.cmd_line.evaluation_plot(open_browser=True)
Plot the interactive 3D evaluation plot.
- pyss3.cmd_line.evaluation_remove(data_path, method, def_cat, hparams)
Remove the evaluation results selected by the user.
- pyss3.cmd_line.intersect(l0, l1)
Given two lists return the intersection.
- pyss3.cmd_line.load_data(data_path, folder_label, cmd_name='test')
Load documents from disk, return the x_data, y_data and categories.
- pyss3.cmd_line.main()
Main function.
- pyss3.cmd_line.overwrite_model(model_path, model_name)
Remove both, the model and cache file.
- pyss3.cmd_line.parse_hparams_args(op_args, defaults=True)
Parse hyperparameters arguments list.
- pyss3.cmd_line.re_in(regex, l)
Given a list of strings, return the first match in the list.
- pyss3.cmd_line.requires_args(func)
A @decorator.
- pyss3.cmd_line.requires_model(func)
A @decorator.
- pyss3.cmd_line.split_args(args)
Parse and split arguments.
- pyss3.cmd_line.subtract(l0, l1)
Subtract list l1 from l0.
- pyss3.cmd_line.train(x_train, y_train, n_grams, train_path='', folder_label=None, save=True, leave_pbar=True)
Train a new model with the given training set.
pyss3.util module
This is a helper module with utility classes and functions.
- class pyss3.util.Dataset
Bases:
objectA helper class with methods to read datasets from disk.
- static load_from_files(data_path, folder_label=True, as_single_doc=False, sep_doc='\n')
Load training/test documents and category labels from disk.
- Parameters:
data_path (str) – the training or the test set path
folder_label (bool) – if True, read category labels from folders, otherwise, read category labels from file names. (default: True)
as_single_doc (bool) – read the documents as a single (and big) document (default: False)
sep_doc (str) – the separator/delimiter used to separate each document when loading training/test documents from single file. Valid only when
folder_label=False. (default:'\n')
- Returns:
the (x_train, y_train) or (x_test, y_test) pairs.
- Return type:
tuple
- static load_from_files_multilabel(docs_path, labels_path, sep_label=None, sep_doc='\n')
Multilabel version of the
Dataset.load_from_files()function.Load training/test documents and category labels from disk.
- Parameters:
docs_path (str) – the file or the folder containing the training/test documents.
labels_path (str) –
the file containing the labels for each document.
if
docs_pathis a file, then thelabels_pathfile
should contain a line with the corresponding list of category labels for each document in
docs_path. For instance, ifsep_doc='\n'and the the content ofdocs_pathis:this is document 1 this is document 2 this is document 3
then, if
sep_label=';', thelabels_pathfile should contain the labels for each document (in order) separated by ;, as follows:labelA;labelB labelA labelB;labelC
if
docs_pathis a folder containing the documents, then
the
labels_pathfile should contain a line for each document and category label. Each line should have the following format:document_name<the sep_label>label. For instance, if thedocs_pathfolder contains the following 3 documents:doc1.txt doc2.txt doc3.txt
Then, following the above example, the
labels_pathfile should be:doc1 labelA doc1 labelB doc2 labelA doc3 labelB doc3 labelC
sep_label (str) – the separator/delimiter used to separate either each label (if
docs_pathis a file) or the document name from its category (ifdocs_pathis a folder). (default:';'whendocs_pathis a file, the'\s+'regular expression otherwise).sep_doc (str) – the separator/delimiter used to separate each document when loading training/test documents from single file. Valid only when
folder_label=False. (default:\n')
- Returns:
the (x_train, y_train) or (x_test, y_test) pairs.
- Return type:
tuple
- Raises:
ValueError
- static load_from_url(zip_url, inner_path=None, folder_label=True, as_single_doc=False, sep_doc='\n')
Load training/test documents and category labels from the given url.
This method download and extract the zip file (given by the
zip_urlurl) into the system’s temporary folder and then callsDataset.load_from_files().- Parameters:
zip_url (str) – the url to the zipped dataset
inner_path (str) – the path within the zip file to be used
folder_label (bool) – if True, read category labels from folders, otherwise, read category labels from file names. (default: True)
as_single_doc (bool) – read the documents as a single (and big) document (default: False)
sep_doc (str) – the separator/delimiter used to separate each document when loading training/test documents from single file. Valid only when
folder_label=False. (default:'\n')
- Returns:
the (x_train, y_train) or (x_test, y_test) pairs.
- Return type:
tuple
- Raises:
FileNotFoundError
- static load_from_url_multilabel(zip_url, labels_path, inner_path=None, sep_label=None, sep_doc='\n')
Load training/test multilabel documents from the given url.
This method download and extract the zip file (given by the
zip_urlurl) into the system’s temporary folder and then callsDataset.load_from_files_multilabel().- Parameters:
zip_url (str) – the url to the zipped dataset
labels_path (str) – the file containing the labels for each document. (please see
Dataset.load_from_files_multilabel()documentation for more info)inner_path (str) – the path within the zip file to be used
sep_label (str) – the separator/delimiter used to separate either each label (if
docs_pathis a file) or the document name from its category (ifdocs_pathis a folder). (default:';'whendocs_pathis a file, the'\s+'regular expression otherwise).sep_doc (str) – the separator/delimiter used to separate each document when loading training/test documents from single file. Valid only when
folder_label=False. (default:\n')
- Returns:
the (x_train, y_train) or (x_test, y_test) pairs.
- Return type:
tuple
- Raises:
FileNotFoundError, ValueError
- class pyss3.util.Evaluation
Bases:
objectEvaluation class.
This class provides the user easy-to-use methods for model evaluation and hyperparameter optimization, like, for example, the
Evaluation.test,Evaluation.kfold_cross_validation,Evaluation.grid_search,Evaluation.plotmethods for performing tests, stratified k-fold cross validations, grid searches for hyperparameter optimization, and visualizing evaluation results using an interactive 3D plot, respectively.All the evaluation methods provided by this class internally use a cache mechanism in which all the previously computed evaluation will be permanently stored for later use. This will prevent the user to waste time performing the same evaluation more than once, or, in case of computer crashes or power failure during a long evaluation, once relunched, it will skip previously computed values and continue the evaluation from the point were the crashed happened on.
Usage:
>>> from pyss3.util import Evaluation
For examples usages for the previously mentioned method, read their documentation, or do one of the tutorials.
- static clear_cache(clf=None)
Wipe out the evaluation cache (for the given classifier).
- Parameters:
clf (SS3) – the classifier (optional)
- static get_best_hyperparameters(metric=None, metric_target='macro avg', tag=None, method=None, def_cat=None)
Return the best hyperparameter values for the given metric.
From all the evaluations performed using the given
method, default category(def_cat) and cachetag, this method returns the hyperparameter values that performed the best, according to the givenmetric, if not supplied, these values will automatically use the ones matching the last performed evaluation.Available metrics are: ‘accuracy’, ‘f1-score’, ‘precision’, and ‘recall’. In addition, In multi-label classification also ‘hamming-loss’ and ‘exact-match’
Except for accuracy, a
metric_targetoption must also be supplied along with themetricindicating the target we aim at measuring, that is, whether we want to measure some averaging performance or the performance on a particular category.- Parameters:
metric (str) – the evaluation metric to return, options are: ‘accuracy’, ‘f1-score’, ‘precision’, or ‘recall’ When working with multi-label classification problems, two more options are allowed: ‘hamming-loss’ and ‘exact-match’. Note: exact match will produce the same result than ‘accuracy’. (default: ‘accuracy’, or ‘hamming-loss’ for multi-label case).
metric_target (str) – the target we aim at measuring with the given metric. Options are: ‘macro avg’, ‘micro avg’, ‘weighted avg’ or a category label (default ‘macro avg’).
tag (str) – the cache tag from where to look up the results (by default it will automatically use the tag of the last evaluation performed)
method (str) – the evaluation method used, options are: ‘test’, ‘K-fold’, where K is a positive integer (by default it will match the method of the last evaluation performed).
def_cat (str) – the default category used the evaluations, options are: ‘most-probable’, ‘unknown’ or a category label (by default it will use the same as the last evaluation performed).
- Returns:
a tuple of the hyperparameter values: (s, l, p, a).
- Return type:
tuple
- Raises:
ValueError, LookupError, KeyError
- static grid_search(clf, x_data, y_data, s=None, l=None, p=None, a=None, k_fold=None, n_grams=None, def_cat='most-probable', prep=True, tag=None, metric=None, metric_target='macro avg', cache=True, extended_pbar=False)
Perform a grid search using the provided hyperparameter values.
Given a test or a training set, this method performs a grid search using the given lists of hyperparameters values. Once finished, it returns the best hyperparameter values found for the given
metric.If the argument
k_foldis provided, the grid search will perform a stratified k-fold cross validation for each hyperparameter value combination. Ifk_foldis not given, will use thex_dataas if it were a test set (x_test) and will use this test set to evaluate the classifier performance for each hyperparameter value combination.Examples:
>>> from pyss3.util import Evaluation >>> ... >>> best_s, _, _, _ = Evaluation.grid_search(clf, x_test, y_test, s=[.3, .4, .5]) >>> print("For this test set, the value of s that obtained the " >>> "best accuracy, among .3, .4, and .5, was:", best_s) >>> ... >>> s, l, p, _ = Evaluation.grid_search(clf, >>> x_test, y_test, >>> s = [.3, .4, .5], >>> l = [.5, 1, 1.5], >>> p = [.5, 1, 2]) >>> print("For this test set and these hyperparameter values, " >>> "the value of s, l and p that obtained the best accuracy were, " >>> "respectively:", s, l, p) >>> ... >>> # since this grid search performs the same computation than the above >>> # cached values will be used, instead of computing all over again. >>> s, l, p, _ = Evaluation.grid_search(clf, >>> x_test, y_test, >>> s = [.3, .4, .5], >>> l = [.5, 1, 1.5], >>> p = [.5, 1, 2], >>> metric="f1-score") >>> print("For this test set and these hyperparameter values, " >>> "the value of s, l and p that obtained the best F1 score were, " >>> "respectively:", s, l, p) >>> ... >>> s, l, p, _ = Evaluation.grid_search(clf, >>> x_train, y_train, >>> s = [.3, .4, .5], >>> l = [.5, 1, 1.5], >>> p = [.5, 1, 2], >>> k_fold=4) >>> print("For this training set and these hyperparameter values, " >>> "and using stratified 4-fold cross validation, " >>> "the value of s, l and p that obtained the best accuracy were, " >>> "respectively:", s, l, p)
- Parameters:
clf (SS3) – the classifier to be evaluated.
x_data (list (of str)) – a list of documents
y_data (list (of str)) – a list of document category labels
s – the list of values for the
shyperparameter (optional). If not given, will take the classifier (clf) current value.l – the list of values for the
lhyperparameter (optional). If not given, will take the classifier (clf) current value.p – the list of values for the
phyperparameter (optional). If not given, will take the classifier (clf) current value.a – the list of values for the
ahyperparameter (optional). If not given, will take the classifier (clf) current value.k_fold (int) – indicates the number of folds to be used (optional). If not given, it will perform the grid search using the
x_dataas the test test.n_grams (int) – indicates the maximum
n-grams to be learned (e.g. a value of1means only 1-grams (words),2means 1-grams and 2-grams,3, 1-grams, 2-grams and 3-grams, and so on.def_cat (str) – default category to be assigned when SS3 is not able to classify a document. Options are ‘most-probable’, ‘unknown’ or a given category name. (default: ‘most-probable’)
prep (bool) – enables the default input preprocessing when classifying (default: True)
tag (str) – the cache tag to be used, i.e. a string to identify this evaluation inside the cache storage (optional)
metric (str) – the evaluation metric to return, options are: ‘accuracy’, ‘f1-score’, ‘precision’, or ‘recall’ When working with multi-label classification problems, two more options are allowed: ‘hamming-loss’ and ‘exact-match’. Note: exact match will produce the same result than ‘accuracy’. (default: ‘accuracy’, or ‘hamming-loss’ for multi-label case).
metric_target (str) – the target we aim at measuring with the given metric. Options are: ‘macro avg’, ‘micro avg’, ‘weighted avg’ or a category label (default ‘macro avg’).
cache (bool) – whether to use cached values or not. Setting
cache=Falseforces to completely perform the evaluation ignoring cached values (default: True).extended_pbar (bool) – whether to show an extra status bar along with the progress bar (default: False).
- Returns:
a tuple of hyperparameter values (s, l, p, a) with the best values for the given metric
- Return type:
tuple
- Raises:
InvalidCategoryError, EmptyModelError, ValueError, TypeError
- static kfold_cross_validation(clf, x_train, y_train, k=4, n_grams=None, def_cat='most-probable', prep=True, tag=None, plot=True, metric=None, metric_target='macro avg', cache=True)
Perform a Stratified k-fold cross validation on the given training set.
Examples:
>>> from pyss3.util import Evaluation >>> ... >>> acc = Evaluation.kfold_cross_validation(clf, x_train, y_train) >>> print("Accuracy obtained using (default) 4-fold cross validation:", acc) >>> ... >>> # this line won't perform the cross validation again, it will retrieve, from the >>> # cache storage, the f1-score value computed in previous evaluation >>> f1 = Evaluation.kfold_cross_validation(clf, x_train, y_train, metric="f1-score") >>> print("F1 score obtained using (default) 4-fold cross validation:", f1) >>> ... >>> f1 = Evaluation.kfold_cross_validation(clf, x_train, y_train, k=10, metric="f1-score") >>> print("F1 score obtained using 10-fold cross validation:", f1)
- Parameters:
clf (SS3) – the classifier to be evaluated.
x_train (list (of str)) – the list of documents
y_train (list (of str)) – the list of document category labels
k (int) – indicates the number of folds to be used (default: 4).
n_grams (int) – indicates the maximum
n-grams to be learned (e.g. a value of1means only 1-grams (words),2means 1-grams and 2-grams,3, 1-grams, 2-grams and 3-grams, and so on.def_cat (str) – default category to be assigned when SS3 is not able to classify a document. Options are ‘most-probable’, ‘unknown’ or a given category name. (default: ‘most-probable’)
prep (bool) – enables the default input preprocessing when classifying (default: True)
tag (str) – the cache tag to be used, i.e. a string to identify this evaluation inside the cache storage (optional)
plot (bool) – whether to plot the confusion matrix after finishing the test or not (default: True)
metric (str) – the evaluation metric to return, options are: ‘accuracy’, ‘f1-score’, ‘precision’, or ‘recall’ When working with multi-label classification problems, two more options are allowed: ‘hamming-loss’ and ‘exact-match’. Note: exact match will produce the same result than ‘accuracy’. (default: ‘accuracy’, or ‘hamming-loss’ for multi-label case).
metric_target (str) – the target we aim at measuring with the given metric. Options are: ‘macro avg’, ‘micro avg’, ‘weighted avg’ or a category label (default ‘macro avg’).
cache (bool) – whether to use cached values or not. Setting
cache=Falseforces to completely perform the evaluation ignoring cached values (default: True).
- Returns:
the given metric value, by default, the obtained accuracy.
- Return type:
float
- Raises:
InvalidCategoryError, EmptyModelError, ValueError, KeyError
- static plot(html_path='./', open_browser=True)
Open up an interactive 3D plot with the obtained results.
This 3D plot is opened up in the web browser and shows the results obtained from all the performed evaluations up to date. In addition, before showing the plot in the browser, this method also creates a portable HTML file containing the 3D plot.
- Parameters:
html_path (str) – the path in which to store the portable HTML file (default: ‘./’)
open_browser (bool) – whether to open the HTML in the browser or not (default: True)
- Raises:
ValueError
- static remove(s=None, l=None, p=None, a=None, method=None, def_cat=None, tag=None, simulate=False)
Remove evaluation results from the cache storage.
If not arguments are given, this method will remove everything, and thus it will perform exactly like the
clear_chachemethod. However, when arguments are given, only values matching that argument value will be removed. For example:>>> # remove evaluation results using >>> # s=.5, l=1, and p=1 hyperparameter values: >>> Evaluation.remove(s=.5, l=1, p=1) >>> >>> # remove all 4-fold cross validation evaluations: >>> Evaluation.remove(method="4-fold")
Besides, this method returns the number of items that were removed. If the argument
simulateis set toTrue, items won’t be removed and only the number of items to be removed will be returned. For example:>>> c, _ = Evaluation.remove(s=.45, method="test", simulate=True) >>> if input("%d items will be removed, proceed? (y/n)") == 'y': >>> Evaluation.remove(s=.45, method="test")
Here is the full list of arguments that can be used to select what to be permanently removed from the cache storage:
- Parameters:
s (float) – a value for the s hyperparameter (optional).
l (float) – a value for the l hyperparameter (optional).
p (float) – a value for the p hyperparameter (optional).
a (float) – a value for the a hyperparameter (optional).
method (str) – an evaluation method, options are: ‘test’, ‘K-fold’, where K is a positive integer (optional).
def_cat (str) – a default category used in the evaluations, options are: ‘most-probable’, ‘unknown’ or a category label (optional).
metric – an evaluation metric, options are: ‘accuracy’, ‘f1-score’, ‘precision’, and ‘recall’ (optional).
tag (str) – a cache tag (optional).
simulate (bool) – whether to simulate the removal or not (default: False)
- Returns:
(number of items removed, details)
- Return type:
tuple
- Raises:
ValueError, TypeError
- static set_classifier(clf)
Set the classifier to be evaluated.
- Parameters:
clf (SS3) – the classifier
- static show_best(tag=None, method=None, def_cat=None, metric=None, avg=None)
Print information regarding the best obtained values according to all the metrics.
The information showed can be filtered out using any of the following arguments:
- Parameters:
tag (str) – a cache tag (optional).
method (str) – an evaluation method, options are: ‘test’, ‘K-fold’, where K is a positive integer (optional).
def_cat (str) – a default category used in the evaluations, options are: ‘most-probable’, ‘unknown’ or a category label (optional).
metric (str) –
an evaluation metric, options are: ‘accuracy’, ‘f1-score’, ‘precision’, and ‘recall’. In addition, In multi-label
classification also ‘hamming-loss’ and ‘exact-match’ (optional).
avg (str) – an averaging method, options are: ‘macro avg’, ‘micro avg’, and ‘weighted avg’ (optional).
- Raises:
ValueError
- static test(clf, x_test, y_test, def_cat='most-probable', prep=True, tag=None, plot=True, metric=None, metric_target='macro avg', cache=True)
Test the model using the given test set.
Examples:
>>> from pyss3.util import Evaluation >>> ... >>> acc = Evaluation.test(clf, x_test, y_test) >>> print("Accuracy:", acc) >>> ... >>> # this line won't perform the test again, it will retrieve, from the cache storage, >>> # the f1-score value computed in previous test. >>> f1 = Evaluation.test(clf, x_test, y_test, metric="f1-score") >>> print("F1 score:", f1)
- Parameters:
clf (SS3) – the classifier to be evaluated.
x_test (list (of str)) – the test set documents, i.e, the list of documents to be classified
y_test (list (of str) or list (of list of str)) – the test set with category labels, i.e, the list of document labels
def_cat (str) – default category to be assigned when SS3 is not able to classify a document. Options are ‘most-probable’, ‘unknown’ or a given category name. (default: ‘most-probable’)
prep (bool) – enables the default input preprocessing when classifying (default: True)
tag (str) – the cache tag to be used, i.e. a string to identify this evaluation inside the cache storage (optional)
plot (bool) – whether to plot the confusion matrix after finishing the test or not (default: True)
metric (str) – the evaluation metric to return, options are: ‘accuracy’, ‘f1-score’, ‘precision’, or ‘recall’ When working with multi-label classification problems, two more options are allowed: ‘hamming-loss’ and ‘exact-match’. Note: exact match will produce the same result than ‘accuracy’. (default: ‘accuracy’, or ‘hamming-loss’ for multi-label case).
metric_target (str) – the target we aim at measuring with the given metric. Options are: ‘macro avg’, ‘micro avg’, ‘weighted avg’ or a category label (default ‘macro avg’).
cache (bool) – whether to use cached values or not. Setting
cache=Falseforces to completely perform the evaluation ignoring cached values (default: True).
- Returns:
the given metric value, by default, the obtained accuracy.
- Return type:
float
- Raises:
EmptyModelError, KeyError, ValueError
- class pyss3.util.Preproc
Bases:
objectA helper class with methods for pre-processing documents.
- static clean_and_ready(text, dots=True, normalize=True, min_len=1)
Clean and prepare the text.
- class pyss3.util.Print
Bases:
objectHelper class to handle print functionalities.
- static error(msg='', raises=None, offset=0, decorator=True)
Print an error.
- Parameters:
msg (str) – the message to show
raises (Exception) – the exception to be raised after showing the message
offset (int) – shift the message to the right (
offsetcharacters)decorator (bool) – if True, use error message decoretor
- static get_verbosity()
Return the verbosity level.
0(quiet): do not output any message (only error messages)1(normal): default behavior, display only warning messages and progress bars2(verbose): display also the informative non-essential messages
- Returns:
the verbosity level
- Return type:
int
- static info(msg='', newln=True, offset=0, decorator=True, force_show=False)
Print an info message.
- Parameters:
msg (str) – the message to show
newln (bool) – use new line after the message (default: True)
offset (int) – shift the message to the right (
offsetcharacters)decorator (bool) – if True, use info message decoretor
force_show (bool) – if True, show message even when not in verbose mode
- static is_quiet()
Check if the current verbosity level is quiet.
- static is_verbose()
Check if the current verbosity level is verbose.
- static set_decorator_error(start, end=None)
Set error messages decorator.
- Parameters:
start (str) – messages preffix
end (str) – messages suffix
- static set_decorator_info(start, end=None)
Set info messages decorator.
- Parameters:
start (str) – messages preffix
end (str) – messages suffix
- static set_decorator_warn(start, end=None)
Set warning messages decorator.
- Parameters:
start (str) – messages preffix
end (str) – messages suffix
- static set_verbosity(level)
Set the verbosity level.
0(quiet): do not output any message (only error messages)1(normal): default behavior, display only warning messages and progress bars2(verbose): display also the informative non-essential messages
The following built-in constants can also be used to refer to these 3 values:
VERBOSITY.QUIET,VERBOSITY.NORMAL, andVERBOSITY.VERBOSE, respectively.For example, if you want PySS3 to hide everything, even progress bars, you could do:
>>> from pyss3.util import Print, VERBOSITY ... >>> Print.set_verbosity(VERBOSITY.QUIET) # or, equivalently, Print.set_verbosity(0) ... >>> # here's the rest of your code :D
- Parameters:
level (int) – the verbosity level
- static show(msg='', newln=True, offset=0, force_show=False)
Print a message.
- Parameters:
msg (str) – the message to show
newln (bool) – use new line after the message (default: True)
offset (int) – shift the message to the right (
offsetcharacters)
- static verbosity_region_begin(level, force=False)
Indicate that a region with different verbosity begins.
When the region ends by calling
verbosity_region_end, the previous verbosity will be restored.Example:
>>> from pyss3.util import Print,VERBOSITY ... >>> Print.verbosity_region_begin(VERBOSITY.QUIET) >>> # inside this region (from now on), verbosity will be 'quiet' ... >>> Print.verbosity_region_end() >>> # the verbosity level is restored to what it was before entering the region
- Parameters:
level (int) – the verbosity level for this region (see
set_verbositydocumentation for valid values)
- static verbosity_region_end()
Indicate that a region with different verbosity ends.
The verbosity will be restored to the value it had before beginning this region with
verbosity_region_begin.Example:
>>> from pyss3.util import Print,VERBOSITY ... >>> Print.verbosity_region_begin(VERBOSITY.VERBOSE) >>> # inside this region (from now on), verbosity will be 'verbose' ... >>> Print.verbosity_region_end() >>> # the verbosity level is restored to what it was before entering the region
- static warn(msg='', newln=True, raises=None, offset=0, decorator=True)
Print a warning message.
- Parameters:
msg (str) – the message to show
newln (bool) – use new line after the message (default: True)
raises (Exception) – the exception to be raised after showing the message
offset (int) – shift the message to the right (
offsetcharacters)decorator (bool) – if True, use warning message decoretor
- class pyss3.util.RecursiveDefaultDict
Bases:
dictA dict whose default value is a dict.
- class pyss3.util.Style
Bases:
objectHelper class to handle print styles.
- static blue(text)
Apply ‘blue’ style to
text.
- static bold(text)
Apply bold style to
text.
- static fail(text)
Apply the ‘fail’ style to
text.
- static green(text)
Apply ‘green’ style to
text.
- static header(text)
Apply ‘header’ style to
text.
- static ubold(text)
Apply underline and bold style to
text.
- static underline(text)
Underline
text.
- static warning(text)
Apply the ‘warning’ style to
text.
- class pyss3.util.VERBOSITY
Bases:
objectverbosity “enum” constants.
- NORMAL = 1
- QUIET = 0
- VERBOSE = 2
- pyss3.util.is_a_collection(o)
Return True when the object
ois a collection.
- pyss3.util.list_by_force(v)
Convert any non-iterable object into a list.
- pyss3.util.membership_matrix(clf, y_data, labels=True, show_pbar=True)
Transform a list of (multiple) labels into a “membership matrix”.
The membership matrix consists converting each list of category labels (i.e., each y in
y_data) into a vector in which there’s a fixed position associated to each learned category, having the value 1 for each label in y, and 0 otherwise.When working with multi-label classification problems, this representation enables measuring the performance using common evaluation metrics such as Hamming loss, exact match ratio, accuracy, precision, recall, F1, etc.
For instance, suppose
y_data = [[], ['labelA'], ['labelB'], ['labelA', 'labelC']]and that the classifierclfhas been trained on 3 categories whose labels are ‘labelA’, ‘labelB’, and ‘labelC’, then, we would have that:>>> membership_matrix(clf, [[], ['labelA'], ['labelB'], ['labelA', 'labelC']])
returns the following membership matrix:
>>> [[0, 0, 0], # [] >>> [1, 0, 0], # ['labelA'] >>> [0, 1, 0], # ['labelB'] >>> [1, 0, 1]] # ['labelA', 'labelC']
- Parameters:
clf (SS3) – the trained classifier
y_data (list of list of str) – the list of document labels
labels (bool) – whether the y_data list contains category labels or category indexes (default: True)
show_pbar (bool) – whether to show the progress bar or not (default: True)
- Returns:
a (sparse) matrix in which each row is the membership vector of each element (labels) in
y_data.- Return type:
scipy.sparse.lil.lil_matrix
- Raises:
ValueError
- pyss3.util.round_fix(v, precision=4)
Round the number v (used to keep the results history file small).