Getting Started¶
The first section describes the way to get the dataset. The second section shows the classical way to use LogFlow. The third section gives a complete example.
Note that more complete explanation can be found on the Cookbook section
Data¶
You can find data to test LogFlow on the LogHub repository: https://github.com/logpai/loghub
The following example is based on the Windows dataset.
Note that LogFlow is optimized to handle several small files rather than one large file. Then, we can split the Windows dataset into several small files by using split on linux.
Workflow¶
For each new dataset, we always need to do three main steps according to the three steps process:
- Parse the logs
- Learn the correlations
- Show the correlations using a correlations tree
- Parse the logs
This first step is split into 4 parts. You need to define a function “parser_function” to get the message part of your log (see example “First Example” for example).
The first part is to read the data and to generate the associated hashmap used by logflow. Assuming you have a “list_files” variable containing the list of files you want to process, this part is done by:
dataset = Dataset(list_files=list_files, parser_function=parser_function) # Generate your data
Next, we can detect the patterns using the previous dataset computed.
patterns = Parser(dataset).detect_pattern() # Detect patterns
Then, we can build the dataset using the previous computed patterns. Note, this step write to the disk the dataset used further for the learning step. This dataset contains the computed patterns and the list of logs replaced by their pattern id.
Dataset(list_files=list_files, dict_patterns=patterns, saving=True, path_data="data/", name_dataset="Windows_test", path_model="model/", parser_function=parser_function) # Apply the detected patterns to the data
The last step is to turn this pattern id into numerical vector
Embedding(loading=True, name_dataset="Windows_test", path_data="data/", path_model="model/").start() # Generate embedding for the LSTM
- Learn the correlations
Now, we can use the model based on a LSTM to learn the correlations between our logs.
The first part is to create the dataset per cardinality. For reminder, during the learning step, we have one LSTM model per cardinality to handle the issue of highly imbalanced dataset.
list_cardinalities = Dataset_learning(path_model="model/", path_data="data/", name_dataset="Windows_test").run() # Create your dataset
Now, we can build our models. Note that you can choose your cardinalties to be learn by setting the cardinalities_choosen parameter.
worker = Worker(cardinalities_choosen=[4,5,6,7], list_cardinalities=list_cardinalities, path_model="model/", name_dataset="Windows_test") # Create the worker
The last step is to start the training step. The models are saved automaticaly. The learning step is stopped using a stopping condition: the last increase of the macro-f1 value is lower than 0.01 during 3 consecutives steps.
worker.train() # Start learning the correlations
- Show the correlations tree.
We can used the previous learned model to show the correlations between our logs.
Again, we create our dataset containing our learned model and the patterns discovered during the first step.
dataset = Dataset_building(path_model="model/", name_model="Windows_test", path_data="data/Windows/Windows.log", parser_function=parser_function) # Build your dataset
We load the files and the logs
dataset.load_files() # Load the model
dataset.load_logs() # Load the logs
We create our workflow process (a workflow is a complete step including the log parser, the embedding step and the model inference to process raw log).
workflow = Workflow(dataset) # Build your workflow
Then, we get the tree!
workflow.get_tree(index_line=24712) # Get the tree of the 2338th line
First Example¶
A complete example is given here. It is based on main.py provided at the root of the repository. According to the three steps process, the example is split into 3 main parts: the logparser, the model and the tree builder.
- LogFlow import
Start by importing LogFlow
from logflow.logsparser.Dataset import Dataset
from logflow.logsparser.Parser import Parser
from logflow.logsparser.Embedding import Embedding
from logflow.logsparser.Journal import Journal
from logflow.relationsdiscover.Dataset import Dataset as Dataset_learning
from logflow.relationsdiscover.Worker import Worker
from logflow.logsparser import Pattern
from logflow.relationsdiscover import Model
from logflow.treebuilding.Dataset import Dataset as Dataset_building
from logflow.treebuilding.Workflow import
- Define functions (optional)
We need to define a function to get the message part of one log entry. If this function is not provided, the default behavior is to split the log entry according to the space caractere and keep only the word after the 9th (included) For the Windows dataset, the message is the words after the 4th word (included)
def parser_function(line):
return line.strip().split()[4:]
If we want to sort the logs according to a field, we can also define a function. For example, using the Windows dataset, we can sort the logs by node. Note that the logs are sorted per file. LogFlow doesn’t sort again the logs per thread. It is a experimental feature, it is better to sort the logs before starting LogFlow.
def split_function(line):
try:
return line.strip().split()[3]
except:
return "1"
def sort_function(list_lines):
return sorted(list_lines, key=lambda line: split_function(line))
- LogParser
We can start the first module. The first step is to create a dataset. Then, the parser is used to detect the patterns. A new dataset is created using the previous discovered patterns and embeddings using word2vec are computing according to this new dataset.
path_logs = "data/Windows/"
list_files = []
for file in listdir(path_logs):
if "x" in file: # Using split command, each small file begins with a "x"
list_files.append(path_logs + "/" + file)
dataset = Dataset(list_files=list_files, parser_function=parser_function) # Generate your data
patterns = Parser(dataset).detect_pattern() # Detect patterns
Dataset(list_files=list_files, dict_patterns=patterns, saving=True, path_data="data/", name_dataset="Windows_test", path_model="model/", parser_function=parser_function, sort_function=sort_function) # Apply the detected patterns to the data
Embedding(loading=True, name_dataset="Windows_test", path_data="data/", path_model="model/").start() # Generate embedding for the LSTM
- Model
We can learn the corrections based on the previous embeddings. We can set a size to used only 1 000 000 lines for examples. It can speed up the learning process.
size=1000000
list_cardinalities = Dataset_learning(path_model="model/", path_data="data/", name_dataset="Windows_test", size=size).run() # Create your dataset
worker = Worker(cardinalities_choosen=[4,5,6,7], list_cardinalities=list_cardinalities, path_model="model/", name_dataset="Windows_test") # Create the worker
worker.train() # Start learning the correlations
- Tree builder
All is done, we can have the tree representing the correlations.
dataset = Dataset_building(path_model="model/", name_model="Windows_test", path_data="data/Windows/Windows.log", parser_function=parser_function) # Build your dataset
dataset.load_files() # Load the model
dataset.load_logs() # Load the logs
workflow = Workflow(dataset) # Build your workflow
workflow.get_tree(index_line=24712) # Get the tree of the 2338th line
- Get the results (optional)
To rate our model, we can merge the results of cardinalities.
results = Results(path_model="model/", name_model="Windows_test")
results.load_files()
results.compute_results(condition="Test")
results.print_results()