Under the Hood: How to Parse 500MB XML Files Without Crashing?

Discover the anatomy of the .nessus format and the engineering challenges of parsing massive XML files. Why classic scripts fail (OOM), how the event-driven approach (Pull parsers) solves the problem, and how to overcome Excel's technical limitations during reporting.

Savinien. G

4/5/20265 min read

Under the Hood: How to Parse 500MB XML Files Without Crashing?

Producing a security audit deliverable inevitably involves extracting raw data generated by scanners. For users of Tenable solutions, the preferred format is the .nessus file. Beneath this extension actually lies a standard XML file.

On paper, extracting data from an XML file is a trivial operation covered in introductory computer science courses. In the reality of a large-scale compliance audit or penetration test, parsing a .nessus file is a true engineering challenge.

This article details the anatomy of these files, the limitations of traditional extraction methods, and the software architectures required to process massive volumes of data without saturating memory (RAM), all the way to generating a high-performance Excel deliverable.

1. The Hierarchical Anatomy of the .nessus Format

To automate a Nessus export, it is essential to understand how the data is structured. The .nessus format relies on a strict hierarchy, encapsulated within a <NessusClientData_v2> root tag.

The typical structure is broken down as follows:

Report: The main container for the scan.
ReportHost: Created for each scanned machine. It contains the host's metadata via a <HostProperties> block that groups named <tag> tags (e.g., <tag name="operating-system">, <tag name="host-ip">).
ReportItem: The core element. There is one ReportItem per vulnerability or control point identified on the host.

It is inside this <ReportItem> that we find the key attributes (pluginID, severity, port, protocol) and the critical child elements for the auditor: <description>, <solution>, <cve>, or even compliance data under the cm:compliance-* namespace.

The Problem of Scale: While this tree structure is readable, it quickly becomes gargantuan. An enterprise scan covering thousands of hosts, with thousands of findings per host, generates an XML file that can easily exceed 500 MB, or even reach a gigabyte.

2. The Memory Trap: DOM vs. Streaming

It is when facing these massive files that "home-made" scripts (usually in Python or PowerShell) show their limitations. The majority of developers default to the DOM (Document Object Model) approach for parsing XML.

The DOM Approach: The Out Of Memory (OOM) Error

DOM parsing loads the entire XML document into RAM to build an object tree. While this method allows for easy navigation through the tree structure, it is fatal for performance. An XML file parsed in DOM typically consumes a volume of RAM representing 5 to 10 times the initial file size, depending on the implementation (meaning several gigabytes for a 500 MB file). This is the primary cause of crashes due to memory saturation.

The Event-Driven Approach: Constant Consumption

To read a large XML file without crashing, software engineering dictates switching to event-driven parsing. Historically known as SAX, this approach is implemented more elegantly today via event iterators (pull parsers, such as XmlReader in .NET).

Instead of building a tree, the parser reads the file sequentially, like a stream. It reacts to every opening tag, text read, and closing tag. The memory footprint thus remains perfectly constant, whether the file weighs 10 MB or 2 GB. Information is read, extracted if necessary, and then immediately purged from memory.

3. Intelligent Extraction: State Machines and Edge Cases

Event-driven parsing poses a new challenge: since the tree is not stored in memory, the program loses the notion of "context". When it reads a <description> tag, it needs to know if it is located in the global metadata or within a specific vulnerability.

The State Machine

The technical solution relies on implementing a finite state machine. The process transitions dynamically as it reads:

Initial State → Waiting for the root.
Report State → Waiting for a ReportHost.
Host State → Waiting for a ReportItem.
Finding State → Capturing specific data.

This approach allows for selective capture. The parser ignores the content of irrelevant tags, thus avoiding the buffering of useless text, which drastically speeds up execution.

Managing the Complexity and Subtleties of the Format

Converting a raw Nessus file also involves handling very specific irregularities on the fly:

Detecting Mixed Scans: A single file can contain both classic vulnerabilities and compliance audits (presence of cm:compliance-* tags). The state machine must be able to dynamically identify them to route them to the correct data structures.
Intelligent Compliance Extraction: Instead of exporting the raw control name (e.g., "18.2.2 Ensure 'Do not allow password expiration'..."), the parser applies regular expressions to cleanly isolate the normative reference ("18.2.2") from the description, thus facilitating readability in Excel.
CDATA Sections: CDATA blocks are massively used in Nessus to include raw text containing reserved XML characters (like <, >, &) without having to escape them. The parser must process and concatenate them properly.
Multiple Elements: A single ReportItem can contain multiple <cve> tags. The code must aggregate them instead of overwriting the value with the last iteration read.

4. The Reporting Wall: Parsing is Good, Formatting is Another Story

This is where many conversion tools hit a wall. Having an engine capable of parsing several gigabytes of XML data in a reasonable time is a technical feat, but it is useless if the final goal is not mastered.

The goal is not to spit out raw data, but to generate a clean Excel report. And that is where the bottleneck of Microsoft Excel comes in.

Technically, an Excel spreadsheet is limited to 1,048,576 rows. In reality, long before reaching this physical limit, injecting hundreds of thousands of rows containing heavy blocks of text (vulnerability descriptions and solutions) will simply freeze or crash the Excel application when opened by the client.

For a vulnerability export to be actionable, simply converting it to CSV is not enough. You must:

Intelligently aggregate and deduplicate results.
Generate interactive dashboards (distribution by severity, Top 10).
Inject dynamic formulas so that charts update if the auditor requalifies a false positive.
Apply professional conditional styling.

5. From Theory to Practice: The Hybrid Architecture of a Dedicated Utility

Developing an event-driven parsing pipeline, merging data, and formatting complex .xlsx files natively requires an engineering effort that strays from the core business of a pentester. Maintaining home-grown scripts to generate beautiful dynamic Excel files quickly proves to be a dependency nightmare.

This is where a specialized desktop application like NtE (Nessus To Excel) makes perfect sense. Its design is based on a strict architectural choice: the separation of concerns via a hybrid stack.

An Isolated Interface (Flutter): The UI is not only fluid, but the architecture relies on isolates (separate threads). Heavy parsing runs in the background, ensuring that the interface never freezes, even when ingesting a 2 GB file.
Parallel Processing and Deduplication: The tool allows for drag-and-dropping multiple .nessus files simultaneously. They are parsed in parallel, and duplicated hosts appearing in multiple files are automatically merged to provide a consolidated view to the client.
The Processing Engine (C# / .NET 8): This is the core of the engine. The switch to a strongly typed and compiled language like C# was dictated by an unavoidable technical necessity: access to an ecosystem of extremely powerful Enterprise-grade libraries. These libraries make it possible to build and manipulate .xlsx files natively, inject dynamic formulas, and generate dashboards without needing to instantiate invisible COM processes, and without even requiring Excel to be installed on the generation machine.

Ultimately, while parsing a Nessus file seems simple on the surface, doing it resiliently without saturating RAM, consolidating multiple scans, and transforming this mass of information into a dynamic Excel deliverable more than justifies abandoning artisanal scripts in favor of dedicated engineering tools.

Under the Hood: How to Parse 500MB XML Files Without Crashing?

Under the Hood: How to Parse 500MB XML Files Without Crashing?

1. The Hierarchical Anatomy of the .nessus Format

2. The Memory Trap: DOM vs. Streaming

3. Intelligent Extraction: State Machines and Edge Cases

4. The Reporting Wall: Parsing is Good, Formatting is Another Story

5. From Theory to Practice: The Hybrid Architecture of a Dedicated Utility

Contacts