Logsparser documentation¶
-
class
logflow.logsparser.Cardinality.Cardinality(counter_general: dict, cardinality: int)[source]¶ A cardinality is a length of line. The length is defined as the number of words.
Parameters: - counter_general (dict) – Counter of the different logs according in all the dataset.
- cardinality (int) – Number of words
-
class
logflow.logsparser.Dataset.Dataset(list_files: list, dict_patterns={}, path_data='', saving=False, name_dataset='', path_model='', concat=True, parser_function='', sort_function='', nb_files_per_chunck=50, output='', nb_cpu=-1, multithreading=True)[source]¶ A dataset is an object containing the data. It uses the Journal class for the reading, the parsing of the logs. It is used for the saving of logs, patterns and parsed files
Parameters: - list_files (list) – List of the logs file to read. Each element of the list is a path to a file.
- dict_patterns (dict, optional) – Patterns previously detected by the first step. If default, the dict is created and the dataset computes the patterns. If provided, the dataset uses the patterns to associate each line of the file to a pattern.. Defaults to {}.
- path_data (str, optional) – Path to the data. Defaults to “”.
- saving (bool, optional) – Saving the patterns to generate the embeddings. Defaults to False.
- name_dataset (str, optional) – Name of the patterns to save. Defaults to “”.
- path_model (str, optional) – Path of the folder to save the patterns. Defaults to “”.
- concat (bool, optional) – Process a chunck of files per thread instead of one file per thread. Increase the performance due to the poor multiprocessing performance of Python. Defaults to True.
- parser_function (function, optional) – Function to split the log entry and get the message part. Defaults to “”, means split according to space and uses the words after the 9th position.
- sort_function (function, optional) – Function to sort the logs. Defaults to “”, means logs are not sorted.
- nb_files_per_chunck (int, optional) – Number of files per chunck. Defaults to 50.
- nb_cpu (int, optional) – Number of threads to be used. Defaults use all the CPUs available.
-
static
execute(journal: logflow.logsparser.Journal.Journal) → logflow.logsparser.Journal.Journal[source]¶ Execute the run() function of the Journal class. It uses for the multithreading implementation.
Parameters: journal (Journal) – A journal to process Returns: A processed journal Return type: Journal
-
static
parser_message(line: str) → List[str][source]¶ Split the line of log and return the message part
Parameters: line (str) – the line of log Returns: the message part of the line represented as a list of words. Return type: list
-
read_files_associating(multithreading=True, concat=True)[source]¶ Read the files and associate one pattern to each line of the files.
If a first step, the function merges the list of files into a list of chunck. Each chunck contains nb_files_per_chunck files. It is done to increase the performance due to pickle/unpickle poor performance between process using Python.
This function executes the run() method of the Journal class for each chunck of files.
Note that we only provide a multithreading implementation for the moment.
Parameters: - multithreading (bool, optional) – [Use the multithreading implementation]. Defaults to True.
- concat (bool, optional) – [Use a chunck of files per thread instead of one file per thread]. Defaults to True.
- nb_files_per_chunck (int, optional) – [Number of files per chunck]. Defaults to 50.
-
read_files_parsing(concat=True)[source]¶ Read the files and compute the patterns.
If a first step, the function merges the list of files into a list of chunck. Each chunck contains nb_files_per_chunck files. It is done to increase the performance due to pickle/unpickle poor performance between process using Python.
This function executes the run() method of the Journal class for each chunck of files.
Note that we only provide a multithreading implementation for the moment.
Parameters: - multithreading (bool, optional) – Use the multithreading implementation. Defaults to True.
- concat (bool, optional) – Use a chunck of files per thread instead of one file per thread. Defaults to True.
-
class
logflow.logsparser.Embedding.Embedding(list_classes=[], loading=False, name_dataset='', path_data='', path_model='', dir_tmp='')[source]¶ Compute the embedding of each pattern based on the word2vec method. Here, each line is represented by the ID (integer) of its pattern.
Note that the word2vec is based on the C++ google implementation. Then, we need to use a file and we cannot use directly the list_classes for the learning step. For best performance, we use temporary file to write the list_classes as a file and then remove it.
- Args:
- list_classes (list, optional): list of patterns. Defaults to []. loading (bool, optional): load the list of patterns from a file. Note that you must provide list_classes is loading is False. Defaults to False. name_dataset (str, optional): name of the dataset. Use for loading it. Defaults to “”. path_data (str, optional): path to the dataset. Defaults to “”. path_model (str, optional): path to the model. Defaults to “”. dir_tmp (str, optional): path used for the temporary file. This path can be on SSD or RAM to better performance. Defaults to “/tmp/”.
-
static
clear_list(args: Tuple[List[int], List[int]]) → list[source]¶ Keep only the words from the list of vocab in the list of patterns.
Parameters: args ((list, list)) – The first argument is the list of patterns. The second is the list of vocab. Returns: list of patterns with only the words into the list of vocab. Return type: list
-
generate_list_embeddings()[source]¶ Filter the list of patterns according to the learned embeddings. The word2vec model requires at least a minimum of examples per word to be learned. We remove the words excluded of the word2vec learning.
-
class
logflow.logsparser.Journal.Journal(parser_message, path: str, associated_pattern=False, dict_patterns={}, large_file=False, pointer=-1, encoding='latin-1', sort_function='', output='')[source]¶ A journal is a list of logs files. It reads, parses and associates the logs and the pattern.
Parameters: - parser_message (function) – Function to split the message part of the line.
- path (str) – path to the data
- associated_pattern (bool, optional) – Associate or discover the patterns. Note that if associated_pattern is True, dict_patterns must be provided. Defaults to False.
- dict_patterns (dict, optional) – Dict of the patterns for the association. Defaults to {}.
- large_file (bool, optional) – Optimization for the reading of one large file. Not implemented yet. Defaults to False.
- pointer (int, optional) – Optimization for the reading of one large file. Not implemented yet. Defaults to -1.
- encoding (str, optional) – Encoding of the files read. Defaults to “latin-1”.
- sort_function (function, optional) – Function to sort the logs. Defaults to “”, means logs are not sorted.
- output (str, optional) – Set the output type. “logpai” to be usable with the benchmark provided by logpai. Defaults return only the ID of log.
-
associate_pattern(line: str)[source]¶ Associate a line with a pattern. Add this pattern to the list of patterns.
Parameters: line (str) – line to be associated.
-
count_log(line: str)[source]¶ Count the number of same entries according to their descriptors. for space and computation optimization.
Example using 3 entries : “Connexion of user Marc” “Connexion of user Marc” “Application failure node [1,0,0,2,4]”
Counter_logs will be : {“Connexion of user Marc”:2, “Application failure node [1,0,0,2,4]”, 1}.
To avoid useless computation, we use a dictionnary of line and line’s descriptors. We do not compute the descriptors each time for each line.
Parameters: line (str) – line of log to add to the counter.
-
static
create_vector(word: str) → str[source]¶ Create the vector of descriptors associated to a word
Parameters: word (str) – the word to describe using descriptors Returns: the descriptors Return type: str
-
filter_word(word: str) → str[source]¶ Get the descriptors of the word
Parameters: word (str) – word to describe Returns: descriptors of the word. They use a string representation of a list. Return type: str
-
static
find_pattern(message: List[str], dict_patterns: dict) → logflow.logsparser.Pattern.Pattern[source]¶ Find the pattern associated to a log.
The best pattern is the pattern with the maximum common words with the line.
Parameters: - message (List[str]) – list of the words of the message part of the log.
- dict_patterns (dict) – the dict of patterns.
Returns: the pattern associated to the line.
Return type:
-
is_number(s: str) → bool[source]¶ Detect if a string is a float.
Parameters: s (str) – string to parse Returns: True if the string is a float, False else. Return type: bool
-
class
logflow.logsparser.Parser.Parser(dataset: logflow.logsparser.Dataset.Dataset)[source]¶ The parser takes a dataset and computes its patterns.
Parameters: dataset (Dataset) – dataset for computing the patterns.
-
class
logflow.logsparser.Pattern.Pattern(cardinality: int, pattern_word: list, pattern_index: list)[source]¶ Represents a pattern. A pattern is described by its cardinality (number of words of the associated line), its words and indexes of these words.
Example : pattern_word = [“house”, “cat”] pattern_index = [“3”, “5”] Here, we are looking for the word “house” at the 3rd position and the word “cat” at the 5th position.
Parameters: - cardinality (int) – Cardinality of the associated line.
- pattern_word (list) – list of the pattern’s words
- pattern_index (list) – list of the pattern’s indexes words