PySS3 Package

Main module

This is the main module containing the implementation of the SS3 classifier.

(Please, visit https://github.com/sergioburdisso/pyss3 for more info)

exception pyss3.EmptyModelError(msg='')

Bases: Exception

Exception to be thrown when the model is empty.

exception pyss3.InvalidCategoryError(msg='')

Bases: Exception

Exception to be thrown when a category is not valid.

class pyss3.SS3(s=None, l=None, p=None, a=None, name='', cv_m='norm_gv_xai', sn_m='xai')

Bases: object

The SS3 classifier class.

The SS3 classifier was originally defined in Section 3 of https://dx.doi.org/10.1016/j.eswa.2019.05.023 (preprint avialable here: https://arxiv.org/abs/1905.08772)

Parameters
  • s (float) – the “smoothness”(sigma) hyperparameter value

  • l (float) – the “significance”(lambda) hyperparameter value

  • p (float) – the “sanction”(rho) hyperparameter value

  • a (float) – the alpha hyperparameter value (i.e. all terms with a confidence value (cv) less than alpha will be ignored during classification)

  • name (str) – the model’s name (to save and load the model from disk)

  • cv_m (str) – method used to compute the confidence value (cv) of each term (word or n-grams), options are: “norm_gv_xai”, “norm_gv” and “gv” (default: “norm_gv_xai”)

  • sn_m (str) – method used to compute the sanction (sn) function, options are: “vanilla” and “xai” (default: “xai”)

classify(doc, prep=True, sort=True, json=False)

Classify a given document.

Parameters
  • doc (str) – the content of the document

  • prep (bool) – enables input preprocessing (default: True)

  • sort (bool) – sort the classification result (from best to worst)

  • json (bool) – return the result in JSON format

Returns

the document confidence vector if sort is False. If sort is True, a list of pairs (category index, confidence value) ordered by cv.

Return type

list

fit(x_train, y_train, n_grams=1, prep=True, leave_pbar=True)

Train the model given a list of documents and category labels.

Parameters
  • x_train (list (of str)) – the list of documents

  • y_train (list (of str)) – the list of document labels

  • n_grams (int) – indicates the maximum n-grams to be learned (e.g. a value of 1 means only 1-grams (words), 2 means 1-grams and 2-grams, 3, 1-grams, 2-grams and 3-grams, and so on.

  • prep (bool) – enables input preprocessing (default: True)

  • leave_pbar (bool) – controls whether to leave the progress bar or remove it after finishing.

get_a()

Get the alpha hyperparameter value.

Returns

the hyperparameter value

Return type

float

get_alpha()

Get the alpha hyperparameter value.

Returns

the hyperparameter value

Return type

float

get_categories()

Get the list of category names.

Returns

the list of category names

Return type

list (of str)

get_category_index(name)

Given its name, return the category index.

Parameters

name (str) – The category name

Returns

the category index

Return type

int

Raises

InvalidCategoryError

get_category_name(index)

Given its index, return the category name.

Parameters

index (int) – The category index

Returns

the category name

Return type

str

Raises

InvalidCategoryError

get_hyperparameters()

Get hyperparameter values.

Returns

a tuple with hyperparameters current values (s, l, p, a)

Return type

tuple

get_l()

Get the “significance” (lambda) hyperparameter value.

Returns

the hyperparameter value

Return type

float

get_most_probable_category()

Get the name of the most probable category.

Returns

the name of the most probable category

Return type

str

get_name()

Return the model’s name.

Returns

the model’s name.

Return type

str

get_next_words(sent, cat, n=None)

Given a sentence, return the list of n (possible) following words.

Parameters
  • sent (str) – a sentence (e.g. “an artificial”)

  • cat (str) – the category name

  • n (int) – the maximum number of possible answers

Returns

a list of tuples (word, frequency, probability)

Return type

list (of tuple)

get_p()

Get the “sanction” (rho) hyperparameter value.

Returns

the hyperparameter value

Return type

float

get_s()

Get the “smoothness” (sigma) hyperparameter value.

Returns

the hyperparameter value

Return type

float

get_sanction()

Get the “sanction” (rho) hyperparameter value.

Returns

the hyperparameter value

Return type

float

get_significance()

Get the “significance” (lambda) hyperparameter value.

Returns

the hyperparameter value

Return type

float

get_smoothness()

Get the “smoothness” (sigma) hyperparameter value.

Returns

the hyperparameter value

Return type

float

get_stopwords(sg_threshold=0.01)

Get the list of (recognized) stopwords.

Parameters

sg_threshold (float) – significance (sg) value used as a threshold to consider words as stopwords (i.e. words with sg < sg_threshold for all categories will be considered as “stopwords”)

Returns

a list of stopwords

Return type

list (of str)

get_word(index)

Given the index, return the word.

Parameters

index (int) – the word index

Returns

the word

Return type

str

get_word_index(word)

Given a word, return its index.

Parameters

name (str) – a word

Returns

the word index

Return type

int

learn(doc, cat, n_grams=1, prep=True, update=True)

Learn a new document for a given category.

Parameters
  • doc (str) – the content of the document

  • cat (str) – the category name

  • n_grams (int) – indicates the maximum n-grams to be learned (e.g. a value of 1 means only 1-grams (words), 2 means 1-grams and 2-grams, 3, 1-grams, 2-grams and 3-grams, and so on.

  • prep (bool) – enables input preprocessing (default: True)

  • update (bool) – enables model auto-update after learning (default: True)

load_model()

Load model from disk.

Raises

IOError

plot_value_distribution(cat)

Plot the category’s global and local value distribution.

Parameters

cat (str) – the category name

predict(x_test, def_cat='most-probable', labels=True, prep=True, leave_pbar=True)

Classify a list of documents.

Parameters
  • x_test (list (of str)) – the list of documents to be classified

  • def_cat (str) – default category to be assigned when SS3 is not able to classify a document. Options are “most-probable”, “unknown” or a given category name.

  • labels (bool) – whether to return the list of category names or just category indexes

  • prep (bool) – enables input preprocessing (default: True)

  • leave_pbar (bool) – controls whether to leave the progress bar or remove it after finishing.

Returns

if labels is True, the list of category names, otherwise, the list of category indexes.

Return type

list (of int or str)

Raises

EmptyModelError

predict_proba(x_test, prep=True, leave_pbar=True)

Classify a list of documents returning a list of confidence vectors.

Parameters
  • x_test (list (of str)) – the list of documents to be classified

  • prep (bool) – enables input preprocessing (default: True)

  • leave_pbar (bool) – controls whether to leave the progress bar after finishing or remove it.

Returns

the list of confidence vectors

Return type

list (of list of float)

Raises

EmptyModelError

print_categories_info()

Print information about learned categories.

print_hyperparameters_info()

Print information about hyperparameters.

print_model_info()

Print information regarding the model.

print_ngram_info(ngram)

Print debugging information about a given n-gram.

Namely, print the n-gram frequency (fr), local value (lv), global value (gv), confidence value (cv), sanction (sn) weight, significance (sg) weight.

Parameters

ngram (str) – the n-gram (e.g. “machine”, “machine learning”, etc.)

save_cat_vocab(cat, path='./', n_grams=-1)

Save category vocabulary to disk.

Parameters
  • cat (str) – the category name

  • path (str) – the path in which to store the vocabulary

  • n_grams (int) – indicates the n-grams to be stored (e.g. only 1-grams, 2-grams, 3-grams, etc.). Default -1 stores all learned n-grams (1-grams, 2-grams, 3-grams, etc.)

save_model()

Save the model to disk.

save_vocab(path='./', n_grams=-1)

Save learned vocabularies to disk.

Parameters
  • path (str) – the path in which to store the vocabularies

  • n_grams (int) – indicates the n-grams to be stored (e.g. only 1-grams, 2-grams, 3-grams, etc.). Default -1 stores all learned n-grams (1-grams, 2-grams, 3-grams, etc.)

set_a(value)

Set the alpha hyperparameter value.

All terms with a confidence value (cv) less than alpha will be ignored during classification.

Parameters

value (float) – the hyperparameter value

set_alpha(value)

Set the alpha hyperparameter value.

All terms with a confidence value (cv) less than alpha will be ignored during classification.

Parameters

value (float) – the hyperparameter value

set_hyperparameters(s=None, l=None, p=None, a=None)

Set hyperparameter values.

Parameters
  • s (float) – the “smoothness” (sigma) hyperparameter

  • l (float) – the “significance” (lambda) hyperparameter

  • p (float) – the “sanction” (rho) hyperparameter

  • a (float) – the alpha hyperparameter (i.e. all terms with a confidence value (cv) less than alpha will be ignored during classification)

set_l(value)

Set the “significance” (lambda) hyperparameter value.

Parameters

value (float) – the hyperparameter value

set_p(value)

Set the “sanction” (rho) hyperparameter value.

Parameters

value (float) – the hyperparameter value

set_s(value)

Set the “smoothness” (sigma) hyperparameter value.

Parameters

value (float) – the hyperparameter value

set_sanction(value)

Set the “sanction” (rho) hyperparameter value.

Parameters

value (float) – the hyperparameter value

set_significance(value)

Set the “significance” (lambda) hyperparameter value.

Parameters

value (float) – the hyperparameter value

set_smoothness(value)

Set the “smoothness” (sigma) hyperparameter value.

Parameters

value (float) – the hyperparameter value

summary_op_ngrams(cvs)

Summary operator for n-gram confidence vectors.

By default it returns the addition of all confidence vectors. However, in case you want to use a custom summary operator, this function must be replaced as shown in the following example:

>>> def my_summary_op(cvs):
>>>     return cvs[0]
>>> ...
>>> clf = SS3()
>>> ...
>>> clf.summary_op_ngrams = my_summary_op

Note that any function receiving a list of vectors and returning a single vector could be used. In the above example the summary operator is replaced by the user-defined my_summary_op which ignores all confidence vectors returning only the confidence vector of the first n-gram (which besides being an illustrative example, makes no real sense).

Parameters

cvs (list (of list of float)) – a list n-grams confidence vectors

Returns

a sentence confidence vector

Return type

list (of float)

summary_op_paragraphs(cvs)

Summary operator for paragraph confidence vectors.

By default it returns the addition of all confidence vectors. However, in case you want to use a custom summary operator, this function must be replaced as shown in the following example:

>>> def dummy_summary_op(cvs):
>>>     return cvs[0]
>>> ...
>>> clf = SS3()
>>> ...
>>> clf.summary_op_paragraphs = dummy_summary_op

Note that any function receiving a list of vectors and returning a single vector could be used. In the above example the summary operator is replaced by the user-defined dummy_summary_op which ignores all confidence vectors returning only the confidence vector of the first paragraph (which besides being an illustrative example, makes no real sense).

Parameters

cvs (list (of list of float)) – a list paragraph confidence vectors

Returns

the document confidence vector

Return type

list (of float)

summary_op_sentences(cvs)

Summary operator for sentence confidence vectors.

By default it returns the addition of all confidence vectors. However, in case you want to use a custom summary operator, this function must be replaced as shown in the following example:

>>> def dummy_summary_op(cvs):
>>>     return cvs[0]
>>> ...
>>> clf = SS3()
>>> ...
>>> clf.summary_op_sentences = dummy_summary_op

Note that any function receiving a list of vectors and returning a single vector could be used. In the above example the summary operator is replaced by the user-defined dummy_summary_op which ignores all confidence vectors returning only the confidence vector of the first sentence (which besides being an illustrative example, makes no real sense).

Parameters

cvs (list (of list of float)) – a list sentence confidence vectors

Returns

a paragraph confidence vector

Return type

list (of float)

update_values(force=False)

Update model values (cv, gv, lv, etc.).

Parameters

force (bool) – force update (even if hyperparameters haven’t changed)

pyss3.key_as_int(dct)

Cast the given dictionary (numerical) keys to int.

pyss3.mad(values, n)

Median absolute deviation mean.

pyss3.sigmoid(v, l)

A sigmoid function.

pyss3.vdiv(v0, v1)

Vectorial version of division.

pyss3.vmax(v0, v1)

Vectorial version of max.

pyss3.vsum(v0, v1)

Vectorial version of sum.

Submodules

pyss3.server module

SS3 classification server with visual explanations for live tests.

(Please, visit https://github.com/sergioburdisso/pyss3 for more info)

class pyss3.server.Server

Bases: object

SS3 HTTP server wrapper.

static get_port()

Return the server port.

Returns

the server port

Return type

int

static serve(clf=None, x_test=None, y_test=None, port=0, browser=True, quiet=True)

Wait for classification requests and serve them.

Parameters
  • clf (pyss3.SS3) – the SS3 model to be attached to this server.

  • x_test (list (of str)) – the list of documents to classify and visualize

  • y_label (list (of str)) – the list of category labels

  • port (int) – the port to listen on (default: random free port)

  • browser (bool) – if True, it automatically opens up the live test on your browser

  • quiet (bool) – if True, use quiet mode. Otherwise use verbose mode (default: False)

static set_model(clf)

Attach a given SS3 model to this server.

Parameters

clf (pyss3.SS3) – an SS3 model

static set_testset(x_test, y_test)

Assign the test set to visualize.

Parameters
  • x_test (list (of str)) – the list of documents to classify and visualize

  • y_label (list (of str)) – the list of category labels

static set_testset_from_files(test_path, folder_label=True)

Load the test set files to visualize from test_path.

Parameters
  • test_path (str) – the test set path

  • folder_label (bool) – if True, read category labels from folders, otherwise, read category labels from file names. (default: True)

Returns

True if category documents were found, False otherwise

Return type

bool

static start_listening(port=0)

Start listening on a port and return its number.

(If a port number is not given, it uses a random free port).

Parameters

port (int) – the port to listen on

pyss3.server.content_type(ext)

Given a file extension, return the content type.

pyss3.server.get_http_body(http_request)

Given a HTTP request, return the body.

pyss3.server.get_http_contlength(http_request)

Given a HTTP request, return the Content-Length value.

pyss3.server.get_http_path(http_request)

Given a HTTP request, return the resource path.

pyss3.server.parse_and_sanitize(rsc_path)

Very simple function to parse and sanitize the given path.

pyss3.cmd_line module

This module lets you interact with your SS3 models through a Command Line.

(Please, visit https://github.com/sergioburdisso/pyss3 for more info)

exception pyss3.cmd_line.ArgsParseError

Bases: Exception

Exception thrown when an error occur parsing commands arguments.

exception pyss3.cmd_line.GetTestDataError

Bases: Exception

Exception thrown when an error occur while retrieving the test data.

class pyss3.cmd_line.SS3Prompt(completekey='tab', stdin=None, stdout=None)

Bases: cmd.Cmd

Prompt main class.

args_classify(args)

Parse classify arguments.

args_evaluations(args)

Parse evaluations arguments.

Parse grid_search arguments.

args_k_fold(args)

Parse k_fold arguments.

args_learn(args)

Parse learn arguments.

args_live_test(args)

Parse live_test arguments.

args_save(args)

Parse save arguments.

args_set(args)

Parse set arguments.

args_test(args)

Parse test arguments.

args_train(args)

Parse train arguments.

complete_evaluations(text, line, begidx, endidx)

Complete arguments for ‘grid_search’ command.

complete_get(text, line, begidx, endidx)

Complete arguments for ‘set’ command.

Complete arguments for ‘grid_search’ command.

complete_info(text, line, begidx, endidx)

Complete arguments for ‘info’ command.

complete_k_fold(text, line, begidx, endidx)

Complete arguments for ‘grid_search’ command.

complete_ld(text, line, begidx, endidx)

Complete arguments for ‘load’ command.

complete_learn(text, line, begidx, endidx)

Complete arguments for ‘learn’ command.

complete_live_test(text, line, begidx, endidx)

Complete arguments for ‘test’ command.

complete_load(text, line, begidx, endidx)

Complete arguments for ‘load’ command.

complete_plot(text, line, begidx, endidx)

Complete arguments for ‘plot’ command.

complete_save(text, line, begidx, endidx)

Complete arguments for ‘save’ command.

complete_set(text, line, begidx, endidx)

Complete arguments for ‘set’ command.

complete_sv(text, line, begidx, endidx)

Complete arguments for ‘save’ command.

complete_test(text, line, begidx, endidx)

Complete arguments for ‘test’ command.

complete_train(text, line, begidx, endidx)

Complete arguments for ‘train’ command.

default(line)

Default error message.

do_EOF(args='')

Quit the program.

do_classify(**kwargs)

Classify a document.

usage:

classify [DOCUMENT_PATH]

optional arguments:

DOCUMENT_PATH the path to the document file

do_clone(**kwargs)

Create a copy of the current model with a given name.

usage:

clone NEW_MODEL_NAME

required arguments:

NEW_MODEL_NAME the new model’s name

do_debug_term(**kwargs)

Show debugging information about a given n-gram.

Namely, print the n-gram frequency (fr), local value (lv), global value (gv), confidence value (cv), sanction (sn) weight and significance (sg) weight.

usage:

debug_term N_GRAM

required arguments:

N_GRAM the n-gram (word, bigram, trigram, etc.) to debug

examples:

debug_term the debug_term potato debug_term “machine learning” debug_term “self driving car”

do_evaluations(**kwargs)

Perform different actions linked to evaluations results.

usage:

evaluations OPTION [PATH] [METHOD] [DEF_CAT] [P VAL [P VAL …]

required arguments:
OPTION indicates the action to perform
values: {info,plot,save,remove} (default: info)
info - show information about evaluations (including

best values).

plot - show an interactive 3-D plot with evaluation

results in the web browser (it also save it to disk).

save - save the interactive 3-D plot to disk. remove - delete evaluations results from history

optional arguments:

PATH the dataset path used in the evaluate of interest

METHOD the method that was used in the evaluate of interest

values: {test,K-fold} where K is an integer > 1

DEF_CAT default category used in the evaluate of interest

values: {most-probable,unknown} or a category label

P VAL the hyperparameter value (only for option “remove”)

P values: {s,l,p,a} VAL values: float

examples:
  • show information about all evaluations:

    evaluations info

  • show information about evaluations in path “a/dataset/path”:

    evaluations info a/dataset/path

  • information about 3-fold evaluations in path “a/dataset/path”:

    evaluations info a/dataset/path 3-fold

  • information about test evaluations in path “a/dataset/path”:

    evaluations info a/dataset/path test

  • plot evaluations:

    evaluations plot

  • save evaluations:

    evaluations save

  • remove all evaluation result(s) in path “a/dataset/path”:

    evaluations remove a/dataset/path

  • remove 4-fold evaluation result(s) in path “a/dataset/path” with l = 1.1 and s = .45:

    evaluations remove a/dataset/path 4-fold l 1.1 s .45

do_exit(args='')

Quit the program.

do_get(**kwargs)

Get a given hyperparameter value.

usage:

get PARAM

required arguments:
PARAM the hyperparameter name

values: {s,l,p,a}

examples:

get s get l get p get a

Given a dataset, perform a grid search using the given hyperparameters values.

usage:

grid_search PATH [LABEL] [DEF_CAT] [METHOD] P EXP [P EXP …] [no-cache]

required arguments:

PATH the dataset path P EXP a list of values for a given hyperparameter.

where:

P is a hyperparameter name. values: {s,l,p,a} EXP is a python expression returning a float or

a list of floats. Note: if this expression contains whitespaces, use quotations marks (e.g. “[0.5, 1.5]”)

examples:

s [.3,.4,.5] s “[.3, .4, .5]” (Note the whitespaces and the “”) p r(.2,.8,6) (i.e. 6 points between .2 to .8)

optional arguments:
LABEL where to read category labels from.

values:{file,folder} (default: folder)

DEF_CAT default category to be assigned when the model is not

able to actually classify a document. values: {most-probable,unknown} or a category label (default: most-probable)

METHOD the method to be used

values: {test, K-fold} (default: test) where:

K-fold indicates the number of folds to be used.

K is an integer > 1 (e.g 4-fold, 10-fold, etc.)

no-cache if present, disable the cache and recompute all the values

examples:

grid_search a/testset/path s r(.2,.8,6) l r(.1,2,6) -p r(.5,2,6) a [0,.01] grid_search a/dataset/path 4-fold -s [.2,.3,.4,.5] -l [.5,1,1.5] -p r(.5,2,6)

do_info(**kwargs)

Show useful information.

usage:

info OPTION

required arguments:
OPTION indicates what information to show
values: {all, parameters, categories, evaluations}

(default: all)

examples:

info info evaluations

do_k_fold(**kwargs)

Perform a stratified k-fold validation using the given dataset set.

usage:

k_fold PATH [LABEL] [DEF_CAT] [N-grams] [N-fold] [P VAL …] [no-cache]

required arguments:

PATHthe dataset path

optional arguments:
LABEL where to read category labels from.

values:{file,folder} (default: folder)

DEF_CAT default category to be assigned when the model is not

able to actually classify a document. values: {most-probable,unknown} or a category label (default: most-probable)

N-grams indicates the maximum n-grams to be learned (e.g. a

value of “1-grams” means only words will be learned; “2-grams” only 1-grams and 2-grams; “3-grams”, only 1-grams, 2-grams and 3-grams; and so on). value: {N-grams} with N integer > 0 (default: 1-grams)

K-fold indicates the number of folds to be used.

value: {K-fold} with K integer > 1 (default: 4-fold)

P VAL sets a hyperparameter value (e.g. s 0.45)

P values: {s,l,p,a} VAL values: float

no-cache if present, disable the cache and recompute values

examples:

k_fold a/dataset/path 10-fold k_fold a/dataset/path 4-fold -s .45 -l 1.1 -p 1

do_learn(**kwargs)

Learn a new document.

usage:

learn CAT [N-grams] [DOCUMENT_PATH]

required arguments:

CAT the category label

optional arguments:
N-grams indicates the maximum n-grams to be learned (e.g. a

value of “1-grams” means only words will be learned; “2-grams” only 1-grams and 2-grams; “3-grams”, only 1-grams, 2-grams and 3-grams; and so on). value: {N-grams} with N integer > 0 (default: 1-grams)

DOCUMENT_PATH the path to the document file

do_license(args)

Print the license.

do_live_test(**kwargs)

Interactively and graphically test the model.

usage:

live_test [TEST_PATH [LABEL]] [verbose]

optional arguments:

TEST_PATH the test set path

LABEL where to read category labels from.

values: {file,folder} (default: folder)

verbose if present, run in verbose mode

examples:

live_test live_test a/testset/path live_test a/testset/path verbose

do_load(**kwargs)

Load a local model (given its name).

usage:

load MODEL_NAME

required arguments:

MODEL_NAME the model’s name

do_new(**kwargs)

Create a new empty SS3 model with a given name.

usage:

new MODEL_NAME

required arguments:

MODEL_NAME the model’s name

do_next_word(**kwargs)

Show up to 3 possible words to follow after the given sentence.

usage:

next_word SENT

required arguments:

SENT a sentence

examples:

next_word “the self driving” next_word “a machine learning”

do_plot(**kwargs)

Plot word value distribution curve or the evaluation results.

usage:

plot OPTION

required arguments:
OPTION indicates what to plot
values:

evaluations; distribution CAT;

where:

CAT the category label

examples:

plot distribution a_category plot evaluations

do_rename(**kwargs)

Rename the current model with a given name.

usage:

rename NEW_MODEL_NAME

required arguments:

NEW_MODEL_NAME the model’s new name

do_save(**kwargs)

Save to disk the model, learned vocabulary, evaluations results, etc.

usage:

save OPTION

required arguments:
OPTION indicates what to save to disk
values:

model; (default) evaluations; vocabulary [CAT]; stopwords [SG_THRESHOLD];

where:

CAT the category label

SG_THRESHOLD significance (sg) value used as a

threshold to consider words as stopwords (i.e. words with sg < sg_threshold for all categories will be considered as “stopwords”) (default: .01)

examples:

save save model save vocabulary save vocabulary a_category save stopwords save stopwords .1

do_set(**kwargs)

Set a given hyperparameter value.

usage:

set P VAL [P VAL …]

required arguments:
P VAL sets a hyperparameter value

examples: s .45; s .5; P values: {s,l,p,a} VAL values: float

examples:

set s .5 set l 0.5 set p 2 set s .5 l 0.5 p 2

do_test(**kwargs)

Test the model using the given test set.

usage:

test TEST_PATH [LABEL] [DEF_CAT] [P VAL …] [no-cache]

required arguments:

TEST_PATH the test set path

optional arguments:
LABEL where to read category labels from.

values:{file,folder} (default: folder)

DEF_CAT default category to be assigned when the model is not

able to actually classify a document. values: {most-probable,unknown} or a category label (default: most-probable)

P VAL sets a hyperparameter value

examples: s .45; s .5; P values: {s,l,p,a} VAL values: float

no-cache if present, disable the cache and recompute values

examples:

test a/testset/path test a/testset/path -s .45 -l 1.1 -p 1 test a/testset/path unknown -s .45 -l 1.1 -p 1 no-cache

do_train(**kwargs)

Train the model using a training set and then save it.

usage:

train TRAIN_PATH [LABEL] [N-gram]

required arguments:

TRAIN_PATH the training set path

optional arguments:
LABEL where to read category labels from.

values:{file,folder} (default: folder)

N-grams indicates the maximum n-grams to be learned (e.g. a

value of “1-grams” means only words will be learned; “2-grams” only 1-grams and 2-grams; “3-grams”, only 1-grams, 2-grams and 3-grams; and so on). value: {N-grams} with N integer > 0 (default: 1-grams)

examples:

train a/training/set/path 3-grams

do_update(**kwargs)

Update model values (cv, gv, lv, etc.).

precmd(line)

Hook method executed just before the command.

preloop()

Hook method executed once when cmdloop() is called.

pyss3.cmd_line.delete_results(data_path, method, def_cat, hparams, only_count=False)

Remove evaluations from history.

pyss3.cmd_line.delete_results_slpa(rh_metric, hparams, only_count=False, best=True)

Remove evaluations from history given hyperparameters s, l, p, a.

pyss3.cmd_line.evaluations_info(data_path=None, method=None)

Print evaluations best values.

pyss3.cmd_line.evaluations_remove(data_path, method, def_cat, hparams)

Evaluation remove command handler.

pyss3.cmd_line.get_global_best(values)

Given a list of evaluations values, return the best one.

pyss3.cmd_line.get_results_history(path, method, def_cat)

Given a path, a method and a default category return results history.

pyss3.cmd_line.get_test_data_cache(path, def_cat, method, s, l, p, a)

Return test results from cache.

Perform a grid search using values from ss, ll, pp, aa.

pyss3.cmd_line.grid_search_loop(data_path, x_test, y_test, categories, def_cat, k_fold, i_fold, ss, ll, pp, aa, cache=True, leave_pbar=True)

Grid search main loop.

pyss3.cmd_line.intersect(l0, l1)

Given two lists return the intersection.

pyss3.cmd_line.is_in_cache(path, method, def_cat, s, l, p, a)

Return whether this evaluation is already computed.

pyss3.cmd_line.json2rh(dct)

Convert a given dictionary to a RecursiveDefaultDict.

pyss3.cmd_line.k_fold2method(k_fold)

Convert the k number to a proper method string.

pyss3.cmd_line.k_fold_classification_report(data_path, method, def_cat, s, l, p, a)

Create the classification report for k-fold validations.

pyss3.cmd_line.k_fold_validation(data_path, folder_label, def_cat, n_grams, k_fold, s, l, p, a, cache=True)

Perform a stratified k-fold cross validation using the given data.

pyss3.cmd_line.load_data(data_path, folder_label, def_cat=None, return_cat_index=True, cmd_name='test')

Load documents from disk, return the x_data, y_data and categories.

pyss3.cmd_line.load_results_history()

Load results history (evaluations) from disk.

pyss3.cmd_line.main()

Main function.

pyss3.cmd_line.module_path(file_path)

Convert a file path relative to this module path.

pyss3.cmd_line.parse_hparams_args(op_args, defaults=True)

Parse hyperparameters arguments list.

pyss3.cmd_line.plot_confusion_matrices(cms, classes, info='', max_colums=3)

Show and plot the confusion matrices.

pyss3.cmd_line.re_in(regex, l)

Given a list of strings, return the first match in the list.

pyss3.cmd_line.requires_args(func)

A @decorator.

pyss3.cmd_line.requires_model(func)

A @decorator.

pyss3.cmd_line.results(y_true, y_pred, categories, def_cat, cache=True, method='test', data_path='', folder=False, plots=True, k_fold=1, i_fold=0)

Compute evaluation results and save them to disk.

pyss3.cmd_line.round_fix(v)

Round the number v (used to keep the results history file small).

pyss3.cmd_line.save_html_evaluations(show_plot=True)

Save results history (evaluations) to disk (interactive html file).

pyss3.cmd_line.save_results(rh, categories, accuracy, report, conf_matrix, k_fold, i_fold, s, l, p, a)

Save evaluation results to disk.

pyss3.cmd_line.save_results_history()

Save results history (evaluations) to disk.

pyss3.cmd_line.split_args(args)

Parse and split arguments.

pyss3.cmd_line.subtract(l0, l1)

Subtract list l1 from l0.

pyss3.cmd_line.test(test_path, folder_label, def_cat, s, l, p, a, cache)

Test the model with a given test set.

pyss3.cmd_line.train(x_train, y_train, n_grams, train_path='', folder_label=None, save=True, leave_pbar=True)

Train a new model with the given training set.

pyss3.util module

This is a helper module with utility classes and functions.

class pyss3.util.Dataset

Bases: object

A helper class with methods to read/write datasets.

static load_from_files(data_path, folder_label=True, as_single_doc=False)

Load category documents from disk.

Parameters
  • data_path (str) – the training or the test set path

  • folder_label (bool) – if True, read category labels from folders, otherwise, read category labels from file names. (default: True)

  • as_single_doc – read the documents as a single (and big) document (default: False)

Returns

the (x_train, y_train) or the (x_test, y_test) pairs.

Return type

tuple

class pyss3.util.Preproc

Bases: object

A helper class with methods to preprocess input documents.

static clean_and_ready(text, dots=True, normalize=True, min_len=1)

Clean and prepare the text.

class pyss3.util.Print

Bases: object

Helper class to handle print functionalities.

static error(msg, raises=None, offset=0, decorator=True)

Print an error.

Parameters
  • msg (str) – the message to show

  • raises (Exception) – the exception to be raised after showing the message

  • offset (int) – shift the message to the right (offset characters)

  • decorator (bool) – if True, use error message decoretor

static info(msg, newln=True, offset=0, decorator=True)

Print an info message.

Parameters
  • msg (str) – the message to show

  • newln (bool) – use new line after the message (default: True)

  • offset (int) – shift the message to the right (offset characters)

  • decorator (bool) – if True, use info message decoretor

static quiet_begin()

Begin a “be quiet” block.

static quiet_end()

End the “be quiet” block.

static set_decorator_error(start, end=None)

Set error messages decorator.

Parameters
  • start (str) – messages preffix

  • end (str) – messages suffix

static set_decorator_info(start, end=None)

Set info messages decorator.

Parameters
  • start (str) – messages preffix

  • end (str) – messages suffix

static set_decorator_warn(start, end=None)

Set warning messages decorator.

Parameters
  • start (str) – messages preffix

  • end (str) – messages suffix

static set_quiet(value)

Set quiet mode value.

When quiet mode is enable, only error messages will be displayed.

Parameters

value (bool) – if True, enables quiet mode

static show(msg='', newln=True, offset=0)

Print a message.

Parameters
  • msg (str) – the message to show

  • newln (bool) – use new line after the message (default: True)

  • offset (int) – shift the message to the right (offset characters)

style

alias of Style

static warn(msg, newln=True, raises=None, offset=0, decorator=True)

Print a warning message.

Parameters
  • msg (str) – the message to show

  • newln (bool) – use new line after the message (default: True)

  • raises (Exception) – the exception to be raised after showing the message

  • offset (int) – shift the message to the right (offset characters)

  • decorator (bool) – if True, use warning message decoretor

class pyss3.util.RecursiveDefaultDict

Bases: dict

A dict whose default value is a dict.

class pyss3.util.Style

Bases: object

Helper class to handle print styles.

static blue(text)

Apply ‘blue’ style to text.

static bold(text)

Apply bold style to text.

static fail(text)

Apply the ‘fail’ style to text.

static green(text)

Apply ‘green’ style to text.

static header(text)

Apply ‘header’ style to text.

static ubold(text)

Apply underline and bold style to text.

static underline(text)

Underline text.

static warning(text)

Apply the ‘warning’ style to text.