PySS3 Package¶
Main module¶
This is the main module containing the implementation of the SS3 classifier.
(Please, visit https://github.com/sergioburdisso/pyss3 for more info)
-
exception
pyss3.
EmptyModelError
(msg='')¶ Bases:
Exception
Exception to be thrown when the model is empty.
-
exception
pyss3.
InvalidCategoryError
(msg='')¶ Bases:
Exception
Exception to be thrown when a category is not valid.
-
class
pyss3.
SS3
(s=None, l=None, p=None, a=None, name='', cv_m='norm_gv_xai', sn_m='xai')¶ Bases:
object
The SS3 classifier class.
The SS3 classifier was originally defined in Section 3 of https://dx.doi.org/10.1016/j.eswa.2019.05.023 (preprint avialable here: https://arxiv.org/abs/1905.08772)
Parameters: - s (float) – the “smoothness”(sigma) hyperparameter value
- l (float) – the “significance”(lambda) hyperparameter value
- p (float) – the “sanction”(rho) hyperparameter value
- a (float) – the alpha hyperparameter value (i.e. all terms with a confidence value (cv) less than alpha will be ignored during classification)
- name (str) – the model’s name (to save and load the model from disk)
- cv_m (str) – method used to compute the confidence value (cv) of each term (word or n-grams), options are: “norm_gv_xai”, “norm_gv” and “gv” (default: “norm_gv_xai”)
- sn_m (str) – method used to compute the sanction (sn) function, options are: “vanilla” and “xai” (default: “xai”)
-
classify
(doc, prep=True, sort=True, json=False)¶ Classify a given document.
Parameters: - doc (str) – the content of the document
- prep (bool) – enables input preprocessing (default: True)
- sort (bool) – sort the classification result (from best to worst)
- json (bool) – return a debugging version of the result in JSON format.
Returns: the document confidence vector if
sort
is False. Ifsort
is True, a list of pairs (category index, confidence value) ordered by confidence value.Return type: list
-
classify_label
(doc, def_cat='most-probable', labels=True, prep=True)¶ Classify a given document returning the category label.
Parameters: - doc (str) – the content of the document
- def_cat (str) – default category to be assigned when SS3 is not able to classify a document. Options are “most-probable”, “unknown” or a given category name. (default: “most-probable”)
- labels (bool) – whether to return the category label or just the category index (default: True)
- prep (bool) – enables input preprocessing (default: True)
Returns: the category label or the category index.
Return type: str or int
Raises: InvalidCategoryError
-
classify_multilabel
(doc, def_cat='most-probable', labels=True, prep=True)¶ Classify a given document returning multiple category labels.
This method could be used to perform multi-label classification. Internally, it uses k-mean clustering on the confidence vector to select the proper group of labels.
Parameters: - doc (str) – the content of the document
- def_cat (str) – default category to be assigned when SS3 is not able to classify a document. Options are “most-probable”, “unknown” or a given category name. (default: “most-probable”)
- labels (bool) – whether to return the category labels or just the category indexes (default: True)
- prep (bool) – enables input preprocessing (default: True)
Returns: the list of category labels (or indexes).
Return type: list (of str or int)
Raises: InvalidCategoryError
-
cv
(ngram, cat)¶ Return the “confidence value” of a given word n-gram for the given category.
This value is obtained applying a final transformation on the global value of the given word n-gram using the gv function [*].
These transformation are given when creating a new SS3 instance (see the SS3 class constructor’s
cv_m
argument for more information).- [*] the gv function is defined in Section 3.2.2 of the original paper:
- https://arxiv.org/pdf/1905.08772.pdf
Example >>> clf.cv(“chicken”, “food”) >>> clf.cv(“roast chicken”, “food”) >>> clf.cv(“chicken”, “sports”)
Parameters: - ngram (str) – the word or word n-gram
- cat (str) – the category label
Returns: the confidence value
Return type: float
Raises: InvalidCategoryError
-
extract_insight
(doc, cat='auto', level='word', window_size=3, min_cv=0.01, sort=True)¶ Get the list of text blocks involved in the classification decision.
Given a document, return the pieces of text that were involved in the classification decision, along with the confidence values associated with them. If a category is given, perform the process as if the given category were the one assigned by the classifier.
Parameters: - doc (str) – the content of the document
- cat (str) – the category in relation to which text blocks are obtained. If not present, it will automatically use the category assigned by SS3 after classification. Options are ‘auto’ or a given category name. (default: ‘auto’)
- level (str) – the level at which text blocks are going to be extracted. Options are ‘word’, ‘sentence’ or ‘paragraph’. (default: ‘word’)
- window_size (int) – the number of words, before and after each identified word,
to be also included along with the identified word. For instance,
window_size=0
means return only individual words,window_size=1
means also include the word that was before and the one that was after them. If multiple selected words are close enough for their word windows to be overlapping, then those word windows will be merged into a longer and single one. This argument is ignored whenlevel
is not equal to ‘word’. (default: 3) - min_cv (float) – the minimum confidence value each text block must have to be included in the output. (default 0.01)
- sort (bool) – whether to return the text blocks ordered by their confidence value
or not. If
sort=False
then blocks will be returned following the order they had in the input document. (default: True)
Returns: a list of pairs (text, confidence value) containing the text (blocks) involved, and to what degree (*), in the classification decision. (*) given by the confidence value
Return type: list
Raises: InvalidCategoryError, ValueError
-
fit
(x_train, y_train, n_grams=1, prep=True, leave_pbar=True)¶ Train the model given a list of documents and category labels.
Parameters: - x_train (list (of str)) – the list of documents
- y_train (list (of str)) – the list of document labels
- n_grams (int) – indicates the maximum
n
-grams to be learned (e.g. a value of1
means only 1-grams (words),2
means 1-grams and 2-grams,3
, 1-grams, 2-grams and 3-grams, and so on. - prep (bool) – enables input preprocessing (default: True)
- leave_pbar (bool) – controls whether to leave the progress bar or remove it after finishing.
-
get_a
()¶ Get the alpha hyperparameter value.
Returns: the hyperparameter value Return type: float
-
get_alpha
()¶ Get the alpha hyperparameter value.
Returns: the hyperparameter value Return type: float
-
get_categories
()¶ Get the list of category names.
Returns: the list of category names Return type: list (of str)
-
get_category_index
(name)¶ Given its name, return the category index.
Parameters: name (str) – The category name Returns: the category index (or IDX_UNKNOWN_CATEGORY
if the category doesn’t exist).Return type: int
-
get_category_name
(index)¶ Given its index, return the category name.
Parameters: index (int) – The category index Returns: the category name (or STR_UNKNOWN_CATEGORY
if the category doesn’t exist).Return type: str
-
get_hyperparameters
()¶ Get hyperparameter values.
Returns: a tuple with hyperparameters current values (s, l, p, a) Return type: tuple
-
get_l
()¶ Get the “significance” (lambda) hyperparameter value.
Returns: the hyperparameter value Return type: float
-
get_most_probable_category
()¶ Get the name of the most probable category.
Returns: the name of the most probable category Return type: str
-
get_name
()¶ Return the model’s name.
Returns: the model’s name. Return type: str
-
get_next_words
(sent, cat, n=None)¶ Given a sentence, return the list of
n
(possible) following words.Parameters: - sent (str) – a sentence (e.g. “an artificial”)
- cat (str) – the category name
- n (int) – the maximum number of possible answers
Returns: a list of tuples (word, frequency, probability)
Return type: list (of tuple)
Raises: InvalidCategoryError
-
get_p
()¶ Get the “sanction” (rho) hyperparameter value.
Returns: the hyperparameter value Return type: float
-
get_s
()¶ Get the “smoothness” (sigma) hyperparameter value.
Returns: the hyperparameter value Return type: float
-
get_sanction
()¶ Get the “sanction” (rho) hyperparameter value.
Returns: the hyperparameter value Return type: float
-
get_significance
()¶ Get the “significance” (lambda) hyperparameter value.
Returns: the hyperparameter value Return type: float
-
get_smoothness
()¶ Get the “smoothness” (sigma) hyperparameter value.
Returns: the hyperparameter value Return type: float
-
get_stopwords
(sg_threshold=0.01)¶ Get the list of (recognized) stopwords.
Parameters: sg_threshold (float) – significance (sg) value used as a threshold to consider words as stopwords (i.e. words with sg < sg_threshold
for all categories will be considered as “stopwords”)Returns: a list of stopwords Return type: list (of str)
-
get_word
(index)¶ Given the index, return the word.
Parameters: index (int) – the word index Returns: the word (or STR_UNKNOWN_WORD
if the word doesn’t exist).Return type: int Return type: str
-
get_word_index
(word)¶ Given a word, return its index.
Parameters: name (str) – a word Returns: the word index (or IDX_UNKNOWN_WORD
if the word doesn’t exist).Return type: int
-
gv
(ngram, cat)¶ Return the “global value” of a given word n-gram for the given category.
- (gv function is defined in Section 3.2.2 of the original paper:
- https://arxiv.org/pdf/1905.08772.pdf)
Example >>> clf.gv(“chicken”, “food”) >>> clf.gv(“roast chicken”, “food”) >>> clf.gv(“chicken”, “sports”)
Parameters: - ngram (str) – the word or word n-gram
- cat (str) – the category label
Returns: the global value
Return type: float
Raises: InvalidCategoryError
-
learn
(doc, cat, n_grams=1, prep=True, update=True)¶ Learn a new document for a given category.
Parameters: - doc (str) – the content of the document
- cat (str) – the category name
- n_grams (int) – indicates the maximum
n
-grams to be learned (e.g. a value of1
means only 1-grams (words),2
means 1-grams and 2-grams,3
, 1-grams, 2-grams and 3-grams, and so on. - prep (bool) – enables input preprocessing (default: True)
- update (bool) – enables model auto-update after learning (default: True)
-
load_model
(path=None)¶ Load model from disk.
if a
path
is not present, the default will be used (“./”), However, if apath
is given, it will not only used to load the model but also will overwrite the default path calling theSS3
’sset_model_path(path)
method (seeset_model_path
method documentation for more detail).Parameters: path (str) – the path to load the model from Raises: OSError
-
lv
(ngram, cat)¶ Return the “local value” of a given word n-gram for the given category.
- (lv function is defined in Section 3.2.2 of the original paper:
- https://arxiv.org/pdf/1905.08772.pdf)
Example >>> clf.lv(“chicken”, “food”) >>> clf.lv(“roast chicken”, “food”) >>> clf.lv(“chicken”, “sports”)
Parameters: - ngram (str) – the word or word n-gram
- cat (str) – the category label
Returns: the local value
Return type: float
Raises: InvalidCategoryError
-
plot_value_distribution
(cat)¶ Plot the category’s global and local value distribution.
Parameters: cat (str) – the category name Raises: InvalidCategoryError
-
predict
(x_test, def_cat='most-probable', labels=True, multilabel=False, prep=True, leave_pbar=True)¶ Classify a list of documents.
Parameters: - x_test (list (of str)) – the list of documents to be classified
- def_cat (str) – default category to be assigned when SS3 is not able to classify a document. Options are “most-probable”, “unknown” or a given category name.
- labels (bool) – whether to return the list of category names or just category indexes
- multilabel (bool) – whether to perform multi-label classification or not.
if enabled, for each document returns a
list
of labels instead of a single label (str
). - prep (bool) – enables input preprocessing (default: True)
- leave_pbar (bool) – controls whether to leave the progress bar or remove it after finishing.
Returns: if
labels
is True, the list of category names, otherwise, the list of category indexes.Return type: list (of int or str)
Raises: EmptyModelError, InvalidCategoryError
-
predict_proba
(x_test, prep=True, leave_pbar=True)¶ Classify a list of documents returning a list of confidence vectors.
Parameters: - x_test (list (of str)) – the list of documents to be classified
- prep (bool) – enables input preprocessing (default: True)
- leave_pbar (bool) – controls whether to leave the progress bar after finishing or remove it.
Returns: the list of confidence vectors
Return type: list (of list of float)
Raises: EmptyModelError
-
print_categories_info
()¶ Print information about learned categories.
-
print_hyperparameters_info
()¶ Print information about hyperparameters.
-
print_model_info
()¶ Print information regarding the model.
-
print_ngram_info
(ngram)¶ Print debugging information about a given n-gram.
Namely, print the n-gram frequency (fr), local value (lv), global value (gv), confidence value (cv), sanction (sn) weight, significance (sg) weight.
Parameters: ngram (str) – the n-gram (e.g. “machine”, “machine learning”, etc.)
-
save_cat_vocab
(cat, path='./', n_grams=-1)¶ Save category vocabulary to disk.
Parameters: - cat (str) – the category name
- path (str) – the path in which to store the vocabulary
- n_grams (int) – indicates the n-grams to be stored (e.g. only 1-grams, 2-grams, 3-grams, etc.). Default -1 stores all learned n-grams (1-grams, 2-grams, 3-grams, etc.)
Raises: InvalidCategoryError
-
save_model
(path=None)¶ Save the model to disk.
if a
path
is not present, the default will be used (“./”), However, if apath
is given, it will not only used to save the model but also will overwrite the default path calling theSS3
’sset_model_path(path)
method (seeset_model_path
method documentation for more detail).Parameters: path (str) – the path to save the model to Raises: OSError
-
save_vocab
(path='./', n_grams=-1)¶ Save learned vocabularies to disk.
Parameters: - path (str) – the path in which to store the vocabularies
- n_grams (int) – indicates the n-grams to be stored (e.g. only 1-grams, 2-grams, 3-grams, etc.). Default -1 stores all learned n-grams (1-grams, 2-grams, 3-grams, etc.)
-
set_a
(value)¶ Set the alpha hyperparameter value.
All terms with a confidence value (cv) less than alpha will be ignored during classification.
Parameters: value (float) – the hyperparameter value
-
set_alpha
(value)¶ Set the alpha hyperparameter value.
All terms with a confidence value (cv) less than alpha will be ignored during classification.
Parameters: value (float) – the hyperparameter value
-
set_block_delimiters
(parag=None, sent=None, word=None)¶ Overwrite the default delimiters used to split input documents into blocks.
delimiters are any regular expression from simple ones (e.g.
" "
) to more complex ones (e.g.r"[^\s\w\d]"
). Note: remember that there are certain reserved characters for regular expression,for example, the dot (.), in which case use the backslash to indicate you’re referring the character itself and not its interpretation (e.g.\.
)e.g.
>>> ss3.set_block_delimiters(word="\s") >>> ss3.set_block_delimiters(word="\s", parag="\n\n") >>> ss3.set_block_delimiters(parag="\n---\n") >>> ss3.set_block_delimiters(sent="\.") >>> ss3.set_block_delimiters(word="\|") >>> ss3.set_block_delimiters(word=" ")
Parameters: - parag (str) – the paragraph new delimiter
- sent (str) – the sentence new delimiter
- word (str) – the word new delimiter
-
set_delimiter_paragraph
(regex)¶ Set the delimiter used to split documents into paragraphs.
Remember that there are certain reserved characters for regular expression, for example, the dot (.), in which case use the backslash to indicate you’re referring the character itself and not its interpretation (e.g.
\.
)Parameters: regex (str) – the regular expression of the new delimiter
-
set_delimiter_sentence
(regex)¶ Set the delimiter used to split documents into sentences.
Remember that there are certain reserved characters for regular expression, for example, the dot (.), in which case use the backslash to indicate you’re referring the character itself and not its interpretation (e.g.
\.
)Parameters: regex (str) – the regular expression of the new delimiter
-
set_delimiter_word
(regex)¶ Set the delimiter used to split documents into words.
Remember that there are certain reserved characters for regular expression, for example, the dot (.), in which case use the backslash to indicate you’re referring the character itself and not its interpretation (e.g.
\.
)Parameters: regex (str) – the regular expression of the new delimiter
-
set_hyperparameters
(s=None, l=None, p=None, a=None)¶ Set hyperparameter values.
Parameters: - s (float) – the “smoothness” (sigma) hyperparameter
- l (float) – the “significance” (lambda) hyperparameter
- p (float) – the “sanction” (rho) hyperparameter
- a (float) – the alpha hyperparameter (i.e. all terms with a confidence value (cv) less than alpha will be ignored during classification)
-
set_l
(value)¶ Set the “significance” (lambda) hyperparameter value.
Parameters: value (float) – the hyperparameter value
-
set_model_path
(path)¶ Overwrite the default path from which the model will be loaded (or saved to).
Note: be aware that the PySS3 Command Line tool looks for a local folder called
ss3_models
to load models. Therefore, thess3_models
folder will be always automatically append to the givenpath
(e.g. ifpath="my/path/"
, it will be converted intomy/path/ss3_models
).Parameters: path (str) – the path
-
set_p
(value)¶ Set the “sanction” (rho) hyperparameter value.
Parameters: value (float) – the hyperparameter value
-
set_s
(value)¶ Set the “smoothness” (sigma) hyperparameter value.
Parameters: value (float) – the hyperparameter value
-
set_sanction
(value)¶ Set the “sanction” (rho) hyperparameter value.
Parameters: value (float) – the hyperparameter value
-
set_significance
(value)¶ Set the “significance” (lambda) hyperparameter value.
Parameters: value (float) – the hyperparameter value
-
set_smoothness
(value)¶ Set the “smoothness” (sigma) hyperparameter value.
Parameters: value (float) – the hyperparameter value
-
sg
(ngram, cat)¶ Return the “significance factor” of a given word n-gram for the given category.
- (sg function is defined in Section 3.2.2 of the original paper:
- https://arxiv.org/pdf/1905.08772.pdf)
Example >>> clf.sg(“chicken”, “food”) >>> clf.sg(“roast chicken”, “food”) >>> clf.sg(“chicken”, “sports”)
Parameters: - ngram (str) – the word or word n-gram
- cat (str) – the category label
Returns: the significance factor
Return type: float
Raises: InvalidCategoryError
-
sn
(ngram, cat)¶ Return the “sanction factor” of a given word n-gram for the given category.
- (sn function is defined in Section 3.2.2 of the original paper:
- https://arxiv.org/pdf/1905.08772.pdf)
Example >>> clf.sn(“chicken”, “food”) >>> clf.sn(“roast chicken”, “food”) >>> clf.sn(“chicken”, “sports”)
Parameters: - ngram (str) – the word or word n-gram
- cat (str) – the category label
Returns: the sanction factor
Return type: float
Raises: InvalidCategoryError
-
summary_op_ngrams
(cvs)¶ Summary operator for n-gram confidence vectors.
By default it returns the addition of all confidence vectors. However, in case you want to use a custom summary operator, this function must be replaced as shown in the following example:
>>> def my_summary_op(cvs): >>> return cvs[0] >>> ... >>> clf = SS3() >>> ... >>> clf.summary_op_ngrams = my_summary_op
Note that any function receiving a list of vectors and returning a single vector could be used. In the above example the summary operator is replaced by the user-defined
my_summary_op
which ignores all confidence vectors returning only the confidence vector of the first n-gram (which besides being an illustrative example, makes no real sense).Parameters: cvs (list (of list of float)) – a list n-grams confidence vectors Returns: a sentence confidence vector Return type: list (of float)
-
summary_op_paragraphs
(cvs)¶ Summary operator for paragraph confidence vectors.
By default it returns the addition of all confidence vectors. However, in case you want to use a custom summary operator, this function must be replaced as shown in the following example:
>>> def dummy_summary_op(cvs): >>> return cvs[0] >>> ... >>> clf = SS3() >>> ... >>> clf.summary_op_paragraphs = dummy_summary_op
Note that any function receiving a list of vectors and returning a single vector could be used. In the above example the summary operator is replaced by the user-defined
dummy_summary_op
which ignores all confidence vectors returning only the confidence vector of the first paragraph (which besides being an illustrative example, makes no real sense).Parameters: cvs (list (of list of float)) – a list paragraph confidence vectors Returns: the document confidence vector Return type: list (of float)
-
summary_op_sentences
(cvs)¶ Summary operator for sentence confidence vectors.
By default it returns the addition of all confidence vectors. However, in case you want to use a custom summary operator, this function must be replaced as shown in the following example:
>>> def dummy_summary_op(cvs): >>> return cvs[0] >>> ... >>> clf = SS3() >>> ... >>> clf.summary_op_sentences = dummy_summary_op
Note that any function receiving a list of vectors and returning a single vector could be used. In the above example the summary operator is replaced by the user-defined
dummy_summary_op
which ignores all confidence vectors returning only the confidence vector of the first sentence (which besides being an illustrative example, makes no real sense).Parameters: cvs (list (of list of float)) – a list sentence confidence vectors Returns: a paragraph confidence vector Return type: list (of float)
-
update_values
(force=False)¶ Update model values (cv, gv, lv, etc.).
Parameters: force (bool) – force update (even if hyperparameters haven’t changed)
-
pyss3.
key_as_int
(dct)¶ Cast the given dictionary (numerical) keys to int.
-
pyss3.
kmean_multilabel_size
(res)¶ Use k-means to tell where to split the ``SS3.classify’‘’s output.
Given a
SS3.classify
’s output (res
), tell where to partition it into 2 clusters so that one of the cluster holds the category labels that the classifier should output when performing multi-label classification. To achieve this, implement k-means (i.e. 2-means) clustering over the category confidence values inres
.Parameters: res (list (of sorted pairs (category, confidence value))) – the classification output of SS3.classify
Returns: a positive integer indicating where to split res
Return type: int
-
pyss3.
mad
(values, n)¶ Median absolute deviation mean.
-
pyss3.
re_split_keep
(regex, string)¶ Force the inclusion of unmatched items by re.split.
This allows keeping the original content after splitting the input document for later use (e.g. for using it from the Live Test)
-
pyss3.
set_verbosity
(level)¶ Set the verbosity level.
0
(quiet): do not output any message (only error messages)1
(normal): default behavior, display only warning messages and progress bars2
(verbose): display also the informative non-essential messages
The following built-in constants can also be used to refer to these 3 values:
VERBOSITY.QUIET
,VERBOSITY.NORMAL
, andVERBOSITY.VERBOSE
, respectively.For example, if you want PySS3 to hide everything, even progress bars, you could simply do:
>>> import pyss3 ... >>> pyss3.set_verbosity(0) ... >>> # here's the rest of your code :D
or, equivalently:
>>> import pyss3 >>> from pyss3 import VERBOSITY ... >>> pyss3.set_verbosity(VERBOSITY.QUIET) ... >>> # here's the rest of your code :D
Parameters: level (int) – the verbosity level
-
pyss3.
sigmoid
(v, l)¶ A sigmoid function.
-
pyss3.
vdiv
(v0, v1)¶ Vectorial version of division.
-
pyss3.
vmax
(v0, v1)¶ Vectorial version of max.
-
pyss3.
vsum
(v0, v1)¶ Vectorial version of sum.
Submodules¶
pyss3.server module¶
SS3 classification server with visual explanations for live tests.
(Please, visit https://github.com/sergioburdisso/pyss3 for more info)
-
pyss3.server.
Live_Test
¶ alias of
pyss3.server.Server
-
class
pyss3.server.
Server
¶ Bases:
object
SS3’s Live Test HTTP server class.
-
static
get_port
()¶ Return the server port.
Returns: the server port Return type: int
-
run
(x_test=None, y_test=None, port=0, browser=True, quiet=True)¶ Wait for classification requests and serve them.
Parameters: - clf (pyss3.SS3) – the SS3 model to be attached to this server.
- x_test (list (of str)) – the list of documents to classify and visualize
- y_label (list (of str)) – the list of category labels
- port (int) – the port to listen on (default: random free port)
- browser (bool) – if True, it automatically opens up the live test on your browser
- quiet (bool) – if True, use quiet mode. Otherwise use verbose mode (default: False)
-
static
serve
(clf=None, x_test=None, y_test=None, port=0, browser=True, quiet=True)¶ Wait for classification requests and serve them.
Parameters: - clf (pyss3.SS3) – the SS3 model to be attached to this server.
- x_test (list (of str)) – the list of documents to classify and visualize
- y_label (list (of str)) – the list of category labels
- port (int) – the port to listen on (default: random free port)
- browser (bool) – if True, it automatically opens up the live test on your browser
- quiet (bool) – if True, use quiet mode. Otherwise use verbose mode (default: False)
-
static
set_model
(clf)¶ Attach a given SS3 model to this server.
Parameters: clf (pyss3.SS3) – an SS3 model
-
static
set_testset
(x_test, y_test)¶ - Assign the test set to visualize.
Parameters: - x_test (list (of str)) – the list of documents to classify and visualize
- y_label (list (of str)) – the list of category labels
-
static
set_testset_from_files
(test_path, folder_label=True)¶ Load the test set files to visualize from
test_path
.Parameters: - test_path (str) – the test set path
- folder_label (bool) – if True, read category labels from folders, otherwise, read category labels from file names. (default: True)
Returns: True if category documents were found, False otherwise
Return type: bool
-
static
start_listening
(port=0)¶ Start listening on a port and return its number.
(If a port number is not given, it uses a random free port).
Parameters: port (int) – the port to listen on
-
static
-
pyss3.server.
content_type
(ext)¶ Given a file extension, return the content type.
-
pyss3.server.
get_http_body
(http_request)¶ Given a HTTP request, return the body.
-
pyss3.server.
get_http_contlength
(http_request)¶ Given a HTTP request, return the Content-Length value.
-
pyss3.server.
get_http_path
(http_request)¶ Given a HTTP request, return the resource path.
-
pyss3.server.
main
()¶ The main function to be called when called from the command-line.
-
pyss3.server.
parse_and_sanitize
(rsc_path)¶ Very simple function to parse and sanitize the given path.
pyss3.cmd_line module¶
This module lets you interact with your SS3 models through a Command Line.
(Please, visit https://github.com/sergioburdisso/pyss3 for more info)
-
exception
pyss3.cmd_line.
ArgsParseError
¶ Bases:
Exception
Exception thrown when an error occur parsing commands arguments.
-
exception
pyss3.cmd_line.
GetTestDataError
¶ Bases:
Exception
Exception thrown when an error occur while retrieving the test data.
-
class
pyss3.cmd_line.
SS3Prompt
(completekey='tab', stdin=None, stdout=None)¶ Bases:
cmd.Cmd
Prompt main class.
-
args_classify
(args)¶ Parse classify arguments.
-
args_evaluations
(args)¶ Parse evaluations arguments.
-
args_grid_search
(args)¶ Parse grid_search arguments.
-
args_k_fold
(args)¶ Parse k_fold arguments.
-
args_learn
(args)¶ Parse learn arguments.
-
args_live_test
(args)¶ Parse live_test arguments.
-
args_save
(args)¶ Parse save arguments.
-
args_set
(args)¶ Parse set arguments.
-
args_test
(args)¶ Parse test arguments.
-
args_train
(args)¶ Parse train arguments.
-
complete_evaluations
(text, line, begidx, endidx)¶ Complete arguments for ‘grid_search’ command.
-
complete_get
(text, line, begidx, endidx)¶ Complete arguments for ‘set’ command.
-
complete_grid_search
(text, line, begidx, endidx)¶ Complete arguments for ‘grid_search’ command.
-
complete_info
(text, line, begidx, endidx)¶ Complete arguments for ‘info’ command.
-
complete_k_fold
(text, line, begidx, endidx)¶ Complete arguments for ‘grid_search’ command.
-
complete_ld
(text, line, begidx, endidx)¶ Complete arguments for ‘load’ command.
-
complete_learn
(text, line, begidx, endidx)¶ Complete arguments for ‘learn’ command.
-
complete_live_test
(text, line, begidx, endidx)¶ Complete arguments for ‘test’ command.
-
complete_load
(text, line, begidx, endidx)¶ Complete arguments for ‘load’ command.
-
complete_plot
(text, line, begidx, endidx)¶ Complete arguments for ‘plot’ command.
-
complete_save
(text, line, begidx, endidx)¶ Complete arguments for ‘save’ command.
-
complete_set
(text, line, begidx, endidx)¶ Complete arguments for ‘set’ command.
-
complete_sv
(text, line, begidx, endidx)¶ Complete arguments for ‘save’ command.
-
complete_test
(text, line, begidx, endidx)¶ Complete arguments for ‘test’ command.
-
complete_train
(text, line, begidx, endidx)¶ Complete arguments for ‘train’ command.
-
default
(line)¶ Default error message.
-
do_EOF
(args='')¶ Quit the program.
-
do_classify
(**kwargs)¶ Classify a document.
- usage:
- classify [DOCUMENT_PATH]
- optional arguments:
- DOCUMENT_PATH the path to the document file
-
do_clone
(**kwargs)¶ Create a copy of the current model with a given name.
- usage:
- clone NEW_MODEL_NAME
- required arguments:
- NEW_MODEL_NAME the new model’s name
-
do_debug_term
(**kwargs)¶ Show debugging information about a given n-gram.
Namely, print the n-gram frequency (fr), local value (lv), global value (gv), confidence value (cv), sanction (sn) weight and significance (sg) weight.
- usage:
- debug_term N_GRAM
- required arguments:
- N_GRAM the n-gram (word, bigram, trigram, etc.) to debug
- examples:
- debug_term the debug_term potato debug_term “machine learning” debug_term “self driving car”
-
do_evaluations
(**kwargs)¶ Perform different actions linked to evaluations results.
- usage:
- evaluations OPTION [PATH] [METHOD] [DEF_CAT] [P VAL [P VAL …]
- required arguments:
- OPTION indicates the action to perform
- values: {info,plot,save,remove} (default: info)
- info - show information about evaluations (including
- best values).
- plot - show an interactive 3-D plot with evaluation
- results in the web browser (it also save it to disk).
save - save the interactive 3-D plot to disk. remove - delete evaluations results from history
- optional arguments:
PATH the dataset path used in the evaluate of interest
- METHOD the method that was used in the evaluate of interest
- values: {test,K-fold} where K is an integer > 1
- DEF_CAT default category used in the evaluate of interest
- values: {most-probable,unknown} or a category label
- P VAL the hyperparameter value (only for option “remove”)
- P values: {s,l,p,a} VAL values: float
- examples:
- show information about all evaluations:
evaluations info
- show information about evaluations in path “a/dataset/path”:
evaluations info a/dataset/path
- information about 3-fold evaluations in path “a/dataset/path”:
evaluations info a/dataset/path 3-fold
- information about test evaluations in path “a/dataset/path”:
evaluations info a/dataset/path test
- plot evaluations:
evaluations plot
- save evaluations:
evaluations save
- remove all evaluation result(s) in path “a/dataset/path”:
evaluations remove a/dataset/path
remove 4-fold evaluation result(s) in path “a/dataset/path” with l = 1.1 and s = .45:
evaluations remove a/dataset/path 4-fold l 1.1 s .45
-
do_exit
(args='')¶ Quit the program.
-
do_get
(**kwargs)¶ Get a given hyperparameter value.
- usage:
- get PARAM
- required arguments:
- PARAM the hyperparameter name
- values: {s,l,p,a}
- examples:
- get s get l get p get a
-
do_grid_search
(**kwargs)¶ Given a dataset, perform a grid search using the given hyperparameters values.
- usage:
- grid_search PATH [LABEL] [DEF_CAT] [METHOD] P EXP [P EXP …] [no-cache]
- required arguments:
PATH the dataset path P EXP a list of values for a given hyperparameter.
- where:
P is a hyperparameter name. values: {s,l,p,a} EXP is a python expression returning a float or
a list of floats. Note: if this expression contains whitespaces, use quotations marks (e.g. “[0.5, 1.5]”)- examples:
- s [.3,.4,.5] s “[.3, .4, .5]” (Note the whitespaces and the “”) p r(.2,.8,6) (i.e. 6 points between .2 to .8)
- optional arguments:
- LABEL where to read category labels from.
- values:{file,folder} (default: folder)
- DEF_CAT default category to be assigned when the model is not
- able to actually classify a document. values: {most-probable,unknown} or a category label (default: most-probable)
- METHOD the method to be used
values: {test, K-fold} (default: test) where:
- K-fold indicates the number of folds to be used.
- K is an integer > 1 (e.g 4-fold, 10-fold, etc.)
no-cache if present, disable the cache and recompute all the values
- examples:
- grid_search a/testset/path s r(.2,.8,6) l r(.1,2,6) -p r(.5,2,6) a [0,.01] grid_search a/dataset/path 4-fold -s [.2,.3,.4,.5] -l [.5,1,1.5] -p r(.5,2,6)
-
do_info
(**kwargs)¶ Show useful information.
- usage:
- info OPTION
- required arguments:
- OPTION indicates what information to show
- values: {all, parameters, categories, evaluations}
- (default: all)
- examples:
- info info evaluations
-
do_k_fold
(**kwargs)¶ Perform a stratified k-fold validation using the given dataset set.
- usage:
- k_fold PATH [LABEL] [DEF_CAT] [N-grams] [N-fold] [P VAL …] [no-cache]
- required arguments:
- PATHthe dataset path
- optional arguments:
- LABEL where to read category labels from.
- values:{file,folder} (default: folder)
- DEF_CAT default category to be assigned when the model is not
- able to actually classify a document. values: {most-probable,unknown} or a category label (default: most-probable)
- N-grams indicates the maximum n-grams to be learned (e.g. a
- value of “1-grams” means only words will be learned; “2-grams” only 1-grams and 2-grams; “3-grams”, only 1-grams, 2-grams and 3-grams; and so on). value: {N-grams} with N integer > 0 (default: 1-grams)
- K-fold indicates the number of folds to be used.
- value: {K-fold} with K integer > 1 (default: 4-fold)
- P VAL sets a hyperparameter value (e.g. s 0.45)
- P values: {s,l,p,a} VAL values: float
no-cache if present, disable the cache and recompute values
- examples:
- k_fold a/dataset/path 10-fold k_fold a/dataset/path 4-fold -s .45 -l 1.1 -p 1
-
do_learn
(**kwargs)¶ Learn a new document.
- usage:
- learn CAT [N-grams] [DOCUMENT_PATH]
- required arguments:
- CAT the category label
- optional arguments:
- N-grams indicates the maximum n-grams to be learned (e.g. a
- value of “1-grams” means only words will be learned; “2-grams” only 1-grams and 2-grams; “3-grams”, only 1-grams, 2-grams and 3-grams; and so on). value: {N-grams} with N integer > 0 (default: 1-grams)
DOCUMENT_PATH the path to the document file
-
do_license
(args)¶ Print the license.
-
do_live_test
(**kwargs)¶ Interactively and graphically test the model.
- usage:
- live_test [TEST_PATH [LABEL]] [verbose]
- optional arguments:
TEST_PATH the test set path
- LABEL where to read category labels from.
- values: {file,folder} (default: folder)
verbose if present, run in verbose mode
- examples:
- live_test live_test a/testset/path live_test a/testset/path verbose
-
do_load
(**kwargs)¶ Load a local model (given its name).
- usage:
- load MODEL_NAME
- required arguments:
- MODEL_NAME the model’s name
-
do_new
(**kwargs)¶ Create a new empty SS3 model with a given name.
- usage:
- new MODEL_NAME
- required arguments:
- MODEL_NAME the model’s name
-
do_next_word
(**kwargs)¶ Show up to 3 possible words to follow after the given sentence.
- usage:
- next_word SENT
- required arguments:
- SENT a sentence
- examples:
- next_word “the self driving” next_word “a machine learning”
-
do_plot
(**kwargs)¶ Plot word value distribution curve or the evaluation results.
- usage:
- plot OPTION
- required arguments:
- OPTION indicates what to plot
- values:
evaluations; distribution CAT;
- where:
- CAT the category label
- examples:
- plot distribution a_category plot evaluations
-
do_rename
(**kwargs)¶ Rename the current model with a given name.
- usage:
- rename NEW_MODEL_NAME
- required arguments:
- NEW_MODEL_NAME the model’s new name
-
do_save
(**kwargs)¶ Save to disk the model, learned vocabulary, evaluations results, etc.
- usage:
- save OPTION
- required arguments:
- OPTION indicates what to save to disk
- values:
model; (default) evaluations; vocabulary [CAT]; stopwords [SG_THRESHOLD];
- where:
CAT the category label
- SG_THRESHOLD significance (sg) value used as a
- threshold to consider words as
stopwords (i.e. words with
sg <
sg_threshold
for all categories will be considered as “stopwords”) (default: .01)
- examples:
- save save model save vocabulary save vocabulary a_category save stopwords save stopwords .1
-
do_set
(**kwargs)¶ Set a given hyperparameter value.
- usage:
- set P VAL [P VAL …]
- required arguments:
- P VAL sets a hyperparameter value
- examples: s .45; s .5; P values: {s,l,p,a} VAL values: float
- examples:
- set s .5 set l 0.5 set p 2 set s .5 l 0.5 p 2
-
do_test
(**kwargs)¶ Test the model using the given test set.
- usage:
- test TEST_PATH [LABEL] [DEF_CAT] [P VAL …] [no-cache]
- required arguments:
- TEST_PATH the test set path
- optional arguments:
- LABEL where to read category labels from.
- values:{file,folder} (default: folder)
- DEF_CAT default category to be assigned when the model is not
- able to actually classify a document. values: {most-probable,unknown} or a category label (default: most-probable)
- P VAL sets a hyperparameter value
- examples: s .45; s .5; P values: {s,l,p,a} VAL values: float
no-cache if present, disable the cache and recompute values
- examples:
- test a/testset/path test a/testset/path -s .45 -l 1.1 -p 1 test a/testset/path unknown -s .45 -l 1.1 -p 1 no-cache
-
do_train
(**kwargs)¶ Train the model using a training set and then save it.
- usage:
- train TRAIN_PATH [LABEL] [N-gram]
- required arguments:
- TRAIN_PATH the training set path
- optional arguments:
- LABEL where to read category labels from.
- values:{file,folder} (default: folder)
- N-grams indicates the maximum n-grams to be learned (e.g. a
- value of “1-grams” means only words will be learned; “2-grams” only 1-grams and 2-grams; “3-grams”, only 1-grams, 2-grams and 3-grams; and so on). value: {N-grams} with N integer > 0 (default: 1-grams)
- examples:
- train a/training/set/path 3-grams
-
do_update
(**kwargs)¶ Update model values (cv, gv, lv, etc.).
-
precmd
(line)¶ Hook method executed just before the command.
-
preloop
()¶ Hook method executed once when cmdloop() is called.
-
-
pyss3.cmd_line.
delete_results
(data_path, method, def_cat, hparams, only_count=False)¶ Remove evaluations from history.
-
pyss3.cmd_line.
delete_results_slpa
(rh_metric, hparams, only_count=False, best=True)¶ Remove evaluations from history given hyperparameters s, l, p, a.
-
pyss3.cmd_line.
evaluations_info
(data_path=None, method=None)¶ Print evaluations best values.
-
pyss3.cmd_line.
evaluations_remove
(data_path, method, def_cat, hparams)¶ Evaluation remove command handler.
-
pyss3.cmd_line.
get_global_best
(values)¶ Given a list of evaluations values, return the best one.
-
pyss3.cmd_line.
get_results_history
(path, method, def_cat)¶ Given a path, a method and a default category return results history.
-
pyss3.cmd_line.
get_test_data_cache
(path, def_cat, method, s, l, p, a)¶ Return test results from cache.
-
pyss3.cmd_line.
grid_search
(data_path, folder_label, def_cat, n_gram, k_fold, ss, ll, pp, aa, cache=True)¶ Perform a grid search using values from ss,
ll
,pp
,aa
.
-
pyss3.cmd_line.
grid_search_loop
(data_path, x_test, y_test, categories, def_cat, k_fold, i_fold, ss, ll, pp, aa, cache=True, leave_pbar=True)¶ Grid search main loop.
-
pyss3.cmd_line.
intersect
(l0, l1)¶ Given two lists return the intersection.
-
pyss3.cmd_line.
is_in_cache
(path, method, def_cat, s, l, p, a)¶ Return whether this evaluation is already computed.
-
pyss3.cmd_line.
json2rh
(dct)¶ Convert a given dictionary to a RecursiveDefaultDict.
-
pyss3.cmd_line.
k_fold2method
(k_fold)¶ Convert the k number to a proper method string.
-
pyss3.cmd_line.
k_fold_classification_report
(data_path, method, def_cat, s, l, p, a)¶ Create the classification report for k-fold validations.
-
pyss3.cmd_line.
k_fold_validation
(data_path, folder_label, def_cat, n_grams, k_fold, s, l, p, a, cache=True)¶ Perform a stratified k-fold cross validation using the given data.
-
pyss3.cmd_line.
load_data
(data_path, folder_label, def_cat=None, return_cat_index=True, cmd_name='test')¶ Load documents from disk, return the x_data, y_data and categories.
-
pyss3.cmd_line.
load_results_history
()¶ Load results history (evaluations) from disk.
-
pyss3.cmd_line.
main
()¶ Main function.
-
pyss3.cmd_line.
module_path
(file_path)¶ Convert a file path relative to this module path.
-
pyss3.cmd_line.
parse_hparams_args
(op_args, defaults=True)¶ Parse hyperparameters arguments list.
-
pyss3.cmd_line.
plot_confusion_matrices
(cms, classes, info='', max_colums=3)¶ Show and plot the confusion matrices.
-
pyss3.cmd_line.
re_in
(regex, l)¶ Given a list of strings, return the first match in the list.
-
pyss3.cmd_line.
requires_args
(func)¶ A @decorator.
-
pyss3.cmd_line.
requires_model
(func)¶ A @decorator.
-
pyss3.cmd_line.
results
(y_true, y_pred, categories, def_cat, cache=True, method='test', data_path='', folder=False, plots=True, k_fold=1, i_fold=0)¶ Compute evaluation results and save them to disk.
-
pyss3.cmd_line.
round_fix
(v)¶ Round the number v (used to keep the results history file small).
-
pyss3.cmd_line.
save_html_evaluations
(show_plot=True)¶ Save results history (evaluations) to disk (interactive html file).
-
pyss3.cmd_line.
save_results
(rh, categories, accuracy, report, conf_matrix, k_fold, i_fold, s, l, p, a)¶ Save evaluation results to disk.
-
pyss3.cmd_line.
save_results_history
()¶ Save results history (evaluations) to disk.
-
pyss3.cmd_line.
split_args
(args)¶ Parse and split arguments.
-
pyss3.cmd_line.
subtract
(l0, l1)¶ Subtract list l1 from l0.
-
pyss3.cmd_line.
test
(test_path, folder_label, def_cat, s, l, p, a, cache)¶ Test the model with a given test set.
-
pyss3.cmd_line.
train
(x_train, y_train, n_grams, train_path='', folder_label=None, save=True, leave_pbar=True)¶ Train a new model with the given training set.
pyss3.util module¶
This is a helper module with utility classes and functions.
-
class
pyss3.util.
Dataset
¶ Bases:
object
A helper class with methods to read/write datasets.
-
static
load_from_files
(data_path, folder_label=True, as_single_doc=False)¶ Load category documents from disk.
Parameters: - data_path (str) – the training or the test set path
- folder_label (bool) – if True, read category labels from folders, otherwise, read category labels from file names. (default: True)
- as_single_doc – read the documents as a single (and big) document (default: False)
Returns: the (x_train, y_train) or the (x_test, y_test) pairs.
Return type: tuple
-
static
-
class
pyss3.util.
Preproc
¶ Bases:
object
A helper class with methods to preprocess input documents.
-
static
clean_and_ready
(text, dots=True, normalize=True, min_len=1)¶ Clean and prepare the text.
-
static
-
class
pyss3.util.
Print
¶ Bases:
object
Helper class to handle print functionalities.
-
static
error
(msg, raises=None, offset=0, decorator=True)¶ Print an error.
Parameters: - msg (str) – the message to show
- raises (Exception) – the exception to be raised after showing the message
- offset (int) – shift the message to the right (
offset
characters) - decorator (bool) – if True, use error message decoretor
-
static
info
(msg, newln=True, offset=0, decorator=True, force_show=False)¶ Print an info message.
Parameters: - msg (str) – the message to show
- newln (bool) – use new line after the message (default: True)
- offset (int) – shift the message to the right (
offset
characters) - decorator (bool) – if True, use info message decoretor
- force_show (bool) – if True, show message even when not in verbose mode
-
static
is_quiet
()¶ Check if the current verbosity level is quiet.
-
static
is_verbose
()¶ Check if the current verbosity level is verbose.
-
static
set_decorator_error
(start, end=None)¶ Set error messages decorator.
Parameters: - start (str) – messages preffix
- end (str) – messages suffix
-
static
set_decorator_info
(start, end=None)¶ Set info messages decorator.
Parameters: - start (str) – messages preffix
- end (str) – messages suffix
-
static
set_decorator_warn
(start, end=None)¶ Set warning messages decorator.
Parameters: - start (str) – messages preffix
- end (str) – messages suffix
-
static
set_verbosity
(level)¶ Set the verbosity level.
0
(quiet): do not output any message (only error messages)1
(normal): default behavior, display only warning messages and progress bars2
(verbose): display also the informative non-essential messages
The following built-in constants can also be used to refer to these 3 values:
VERBOSITY.QUIET
,VERBOSITY.NORMAL
, andVERBOSITY.VERBOSE
, respectively.For example, if you want PySS3 to hide everything, even progress bars, you could do:
>>> from pyss3.util import Print, VERBOSITY ... >>> Print.set_verbosity(VERBOSITY.QUIET) # or, equivalently, Print.set_verbosity(0) ... >>> # here's the rest of your code :D
Parameters: level (int) – the verbosity level
-
static
show
(msg='', newln=True, offset=0)¶ Print a message.
Parameters: - msg (str) – the message to show
- newln (bool) – use new line after the message (default: True)
- offset (int) – shift the message to the right (
offset
characters)
-
static
verbosity_region_begin
(level)¶ Indicate that a region with different verbosity begins.
When the region ends by calling
verbosity_region_end
, the previous verbosity will be restored.Example:
>>> from pyss3.util import Print,VERBOSITY ... >>> Print.verbosity_region_begin(VERBOSITY.QUIET) >>> # inside this region (from now on), verbosity will be 'quiet' ... >>> Print.verbosity_region_end() >>> # the verbosity level is restored to what it was before entering the region
Parameters: level (int) – the verbosity level for this region (see set_verbosity
documentation for valid values)
-
static
verbosity_region_end
()¶ Indicate that a region with different verbosity ends.
The verbosity will be restored to the value it had before beginning this region with
verbosity_region_begin
.Example:
>>> from pyss3.util import Print,VERBOSITY ... >>> Print.verbosity_region_begin(VERBOSITY.VERBOSE) >>> # inside this region (from now on), verbosity will be 'verbose' ... >>> Print.verbosity_region_end() >>> # the verbosity level is restored to what it was before entering the region
-
static
warn
(msg, newln=True, raises=None, offset=0, decorator=True)¶ Print a warning message.
Parameters: - msg (str) – the message to show
- newln (bool) – use new line after the message (default: True)
- raises (Exception) – the exception to be raised after showing the message
- offset (int) – shift the message to the right (
offset
characters) - decorator (bool) – if True, use warning message decoretor
-
static
-
class
pyss3.util.
RecursiveDefaultDict
¶ Bases:
dict
A dict whose default value is a dict.
-
class
pyss3.util.
Style
¶ Bases:
object
Helper class to handle print styles.
-
static
blue
(text)¶ Apply ‘blue’ style to
text
.
-
static
bold
(text)¶ Apply bold style to
text
.
-
static
fail
(text)¶ Apply the ‘fail’ style to
text
.
-
static
green
(text)¶ Apply ‘green’ style to
text
.
-
static
header
(text)¶ Apply ‘header’ style to
text
.
-
static
ubold
(text)¶ Apply underline and bold style to
text
.
-
static
underline
(text)¶ Underline
text
.
-
static
warning
(text)¶ Apply the ‘warning’ style to
text
.
-
static