Needle in a Haystack – AI powered Log-Stream processing
- Mike
- Jan 16
- 2 min read

Project Overview:
This project aimed to identify patterns across multiple log streams from interdependent applications, associate these patterns with historical incidents, and preemptively correct systems to enhance reliability and stability.
Challenges:
1. Patterns indicating issues were dispersed across numerous logs and applications.
2. Most patterns were previously unknown.
3. High noise in some logs: Up to 20% of method calls were logged as errors without affecting functionality.
4. Logs encompassed hundreds of thousands of calls per minute, generating hundreds of gigabytes daily.
5. Pattern timeframes varied, with some preceding incidents by minutes or seconds.
Approach:
A multi-model AI strategy was employed to detect and rank patterns by their relevance to impending problems. Both supervised and unsupervised learning techniques were utilized. AWS SageMaker was chosen to evaluate models on a subset of the data. The selected model combination was deployed via SageMaker batch inference and as comparison, in a Python-based serverless environment to compare throughput and cost.
SageMaker Challenges:
1. Model-specific quirks in handling textual and numerical log data necessitated custom preprocessing.
2. Certain models were newly developed and required debugging.
3. Documentation for some models was incomplete, complicating implementation.
Model Evaluation:
Guidance was sought from experienced teams to streamline the evaluation of multiple models in parallel. In the end the most effective approach combined:
• Neural Topic Models: To uncover and cluster topics using K-means.
• Transformers: To capture topic context and patterns.
• Hidden Markov Models: To infer the internal state from contextual patterns.
Implementation:
The optimal and cost effective solution ran the models as serverless processes on AWS Fargate, with Python orchestrating the steps applied to incoming log streams.
Conclusion:
The project successfully identified unknown patterns indicative of future incidents. However, the vast data volume and low signal-to-noise ratio made the process slow and complex. The findings highlighted the need to:
1. Standardize log formats across applications.
2. Label errors with consistent severities.
3. Add reliable correlation keys.
4. Pre-filter and clean log data before model ingestion.
5. Tapping into experience can accelerate model selection significantly
These improvements would better serve the business and enhance the efficiency of AI-driven log analysis because even with AI, Garbage-in still means Garbage-out.
The project has laid the groundwork for more efficient, scalable, and intelligent log analysis processes. With continued refinement and collaboration, the business is well-positioned to realize the full potential of AI-driven insights, ensuring greater stability and reliability across its systems.
Comments