Logsparser documentation

class logflow.logsparser.Cardinality.Cardinality(counter_general: dict, cardinality: int)[source]

A cardinality is a length of line. The length is defined as the number of words.

Parameters:
  • counter_general (dict) – Counter of the different logs according in all the dataset.
  • cardinality (int) – Number of words
compute() → Dict[int, List[logflow.logsparser.Pattern.Pattern]][source]

Start the workflow for the multithreading implementation.

Returns:the dict of patterns detected.
Return type:(dict)
counter_word()[source]

Count the number of words according to their place in the log.

detect_patterns()[source]

Detect the pattern based on the maximum number of similar words.

order_pattern()[source]

Order the pattern by size to have a fast association between lines and patterns.

class logflow.logsparser.Dataset.Dataset(list_files: list, dict_patterns={}, path_data='', saving=False, name_dataset='', path_model='', concat=True, parser_function='', sort_function='', nb_files_per_chunck=50, output='', nb_cpu=-1, multithreading=True)[source]

A dataset is an object containing the data. It uses the Journal class for the reading, the parsing of the logs. It is used for the saving of logs, patterns and parsed files

Parameters:
  • list_files (list) – List of the logs file to read. Each element of the list is a path to a file.
  • dict_patterns (dict, optional) – Patterns previously detected by the first step. If default, the dict is created and the dataset computes the patterns. If provided, the dataset uses the patterns to associate each line of the file to a pattern.. Defaults to {}.
  • path_data (str, optional) – Path to the data. Defaults to “”.
  • saving (bool, optional) – Saving the patterns to generate the embeddings. Defaults to False.
  • name_dataset (str, optional) – Name of the patterns to save. Defaults to “”.
  • path_model (str, optional) – Path of the folder to save the patterns. Defaults to “”.
  • concat (bool, optional) – Process a chunck of files per thread instead of one file per thread. Increase the performance due to the poor multiprocessing performance of Python. Defaults to True.
  • parser_function (function, optional) – Function to split the log entry and get the message part. Defaults to “”, means split according to space and uses the words after the 9th position.
  • sort_function (function, optional) – Function to sort the logs. Defaults to “”, means logs are not sorted.
  • nb_files_per_chunck (int, optional) – Number of files per chunck. Defaults to 50.
  • nb_cpu (int, optional) – Number of threads to be used. Defaults use all the CPUs available.
static execute(journal: logflow.logsparser.Journal.Journal) → logflow.logsparser.Journal.Journal[source]

Execute the run() function of the Journal class. It uses for the multithreading implementation.

Parameters:journal (Journal) – A journal to process
Returns:A processed journal
Return type:Journal
static parser_message(line: str) → List[str][source]

Split the line of log and return the message part

Parameters:line (str) – the line of log
Returns:the message part of the line represented as a list of words.
Return type:list
read_files_associating(multithreading=True, concat=True)[source]

Read the files and associate one pattern to each line of the files.

If a first step, the function merges the list of files into a list of chunck. Each chunck contains nb_files_per_chunck files. It is done to increase the performance due to pickle/unpickle poor performance between process using Python.

This function executes the run() method of the Journal class for each chunck of files.

Note that we only provide a multithreading implementation for the moment.

Parameters:
  • multithreading (bool, optional) – [Use the multithreading implementation]. Defaults to True.
  • concat (bool, optional) – [Use a chunck of files per thread instead of one file per thread]. Defaults to True.
  • nb_files_per_chunck (int, optional) – [Number of files per chunck]. Defaults to 50.
read_files_parsing(concat=True)[source]

Read the files and compute the patterns.

If a first step, the function merges the list of files into a list of chunck. Each chunck contains nb_files_per_chunck files. It is done to increase the performance due to pickle/unpickle poor performance between process using Python.

This function executes the run() method of the Journal class for each chunck of files.

Note that we only provide a multithreading implementation for the moment.

Parameters:
  • multithreading (bool, optional) – Use the multithreading implementation. Defaults to True.
  • concat (bool, optional) – Use a chunck of files per thread instead of one file per thread. Defaults to True.
stats()[source]

Show classes distribution across the dataset

class logflow.logsparser.Embedding.Embedding(list_classes=[], loading=False, name_dataset='', path_data='', path_model='', dir_tmp='')[source]

Compute the embedding of each pattern based on the word2vec method. Here, each line is represented by the ID (integer) of its pattern.

Note that the word2vec is based on the C++ google implementation. Then, we need to use a file and we cannot use directly the list_classes for the learning step. For best performance, we use temporary file to write the list_classes as a file and then remove it.

Args:
list_classes (list, optional): list of patterns. Defaults to []. loading (bool, optional): load the list of patterns from a file. Note that you must provide list_classes is loading is False. Defaults to False. name_dataset (str, optional): name of the dataset. Use for loading it. Defaults to “”. path_data (str, optional): path to the dataset. Defaults to “”. path_model (str, optional): path to the model. Defaults to “”. dir_tmp (str, optional): path used for the temporary file. This path can be on SSD or RAM to better performance. Defaults to “/tmp/”.
static clear_list(args: Tuple[List[int], List[int]]) → list[source]

Keep only the words from the list of vocab in the list of patterns.

Parameters:args ((list, list)) – The first argument is the list of patterns. The second is the list of vocab.
Returns:list of patterns with only the words into the list of vocab.
Return type:list
create_temporary_file()[source]

Create the temporary files for the learning step

generate_list_embeddings()[source]

Filter the list of patterns according to the learned embeddings. The word2vec model requires at least a minimum of examples per word to be learned. We remove the words excluded of the word2vec learning.

static list_to_str(list_str: List[str]) → str[source]

Merge a list of integer into a string.

Parameters:list_str (list) – list of integer
Returns:string representation
Return type:str
load()[source]

Loads the files

start()[source]

Starts the process

train()[source]

Trains the word2vec model based on the list of patterns.

class logflow.logsparser.Journal.Journal(parser_message, path: str, associated_pattern=False, dict_patterns={}, large_file=False, pointer=-1, encoding='latin-1', sort_function='', output='')[source]

A journal is a list of logs files. It reads, parses and associates the logs and the pattern.

Parameters:
  • parser_message (function) – Function to split the message part of the line.
  • path (str) – path to the data
  • associated_pattern (bool, optional) – Associate or discover the patterns. Note that if associated_pattern is True, dict_patterns must be provided. Defaults to False.
  • dict_patterns (dict, optional) – Dict of the patterns for the association. Defaults to {}.
  • large_file (bool, optional) – Optimization for the reading of one large file. Not implemented yet. Defaults to False.
  • pointer (int, optional) – Optimization for the reading of one large file. Not implemented yet. Defaults to -1.
  • encoding (str, optional) – Encoding of the files read. Defaults to “latin-1”.
  • sort_function (function, optional) – Function to sort the logs. Defaults to “”, means logs are not sorted.
  • output (str, optional) – Set the output type. “logpai” to be usable with the benchmark provided by logpai. Defaults return only the ID of log.
associate_pattern(line: str)[source]

Associate a line with a pattern. Add this pattern to the list of patterns.

Parameters:line (str) – line to be associated.
count_log(line: str)[source]

Count the number of same entries according to their descriptors. for space and computation optimization.

Example using 3 entries : “Connexion of user Marc” “Connexion of user Marc” “Application failure node [1,0,0,2,4]”

Counter_logs will be : {“Connexion of user Marc”:2, “Application failure node [1,0,0,2,4]”, 1}.

To avoid useless computation, we use a dictionnary of line and line’s descriptors. We do not compute the descriptors each time for each line.

Parameters:line (str) – line of log to add to the counter.
static create_vector(word: str) → str[source]

Create the vector of descriptors associated to a word

Parameters:word (str) – the word to describe using descriptors
Returns:the descriptors
Return type:str
filter_word(word: str) → str[source]

Get the descriptors of the word

Parameters:word (str) – word to describe
Returns:descriptors of the word. They use a string representation of a list.
Return type:str
static find_pattern(message: List[str], dict_patterns: dict) → logflow.logsparser.Pattern.Pattern[source]

Find the pattern associated to a log.

The best pattern is the pattern with the maximum common words with the line.

Parameters:
  • message (List[str]) – list of the words of the message part of the log.
  • dict_patterns (dict) – the dict of patterns.
Returns:

the pattern associated to the line.

Return type:

Pattern

is_number(s: str) → bool[source]

Detect if a string is a float.

Parameters:s (str) – string to parse
Returns:True if the string is a float, False else.
Return type:bool
read_file()[source]

Read the logs files.

run()[source]

Start the process

static static_filter_word(word: str) → str[source]

Get the descriptors of the word

Parameters:word (str) – word to describe
Returns:descriptors of the word. They use a string representation of a list.
Return type:str
static static_is_number(s: str) → bool[source]

Detect if a string is a float.

Parameters:s (str) – string to parse
Returns:True if the string is a float, False else.
Return type:bool
class logflow.logsparser.Parser.Parser(dataset: logflow.logsparser.Dataset.Dataset)[source]

The parser takes a dataset and computes its patterns.

Parameters:dataset (Dataset) – dataset for computing the patterns.
detect_pattern() → dict[source]

Detect the patterns of the dataset and return the dict of patterns.

Returns:dict of patterns computed.
Return type:dict
class logflow.logsparser.Pattern.Pattern(cardinality: int, pattern_word: list, pattern_index: list)[source]

Represents a pattern. A pattern is described by its cardinality (number of words of the associated line), its words and indexes of these words.

Example : pattern_word = [“house”, “cat”] pattern_index = [“3”, “5”] Here, we are looking for the word “house” at the 3rd position and the word “cat” at the 5th position.

Parameters:
  • cardinality (int) – Cardinality of the associated line.
  • pattern_word (list) – list of the pattern’s words
  • pattern_index (list) – list of the pattern’s indexes words