Unstructured Data Protection

Data protection has never been more important, and the value of data continues to rise as it is used to train more advanced Artificial Intelligence (AI) models.

In simplistic terms, data can be categorised as structured or unstructured.

Structured Data - Organised and formatted in a predefined way (e.g., discrete data types such as numbers, short text, dates, etc.)
Unstructured Data - Lacks a predefined format and is stored in its native form (e.g., documents, images, videos, etc.)

Within a business, structured data is commonly persisted within enterprise data stores or systems of record. These are usually tightly governed, with roles, processes and tools to support stewardship, quality, privacy, compliance, security, etc.

The data within these stores or systems are usually secured at source, meaning tight controls regarding how and where the data is persisted and processed, alongside specific standards for ingestion and consumption.

Unfortunately, unstructured data is far less predictable. For example, within my business, we store approximately 240 million unstructured files and produces more than 1 million new files every month. We leverage Microsoft Office 365, therefore the majority of these files are Word, PowerPoint, Excel and PDF documents.

To compound this issue, we also send approximately 60,000 emails daily, which can include business-sensitive information and proliferate the uncontrolled sharing of files through attachments, etc.

Historically, it was common for these files to be completely uncontrolled, with no integrated protection. As a result, they are at risk of data loss, either through malicious actions or an accident.

Therefore, we recently implemented a new global data protection programme that has been designed to address these risks. It incorporates a new Information Classification framework and a series of technical controls, which are enforced (mandatory).

The high-level framework is outlined below.

The goal of the Information Classification framework is to ensure all unstructured data is appropriately classified, using a label.

The three tiers are.

Public - Data approved for public release.
Internal - Data used within the company and with trusted partners.
Restricted - Data that is sensitive and/or confidential.

The simplicity of a three-tier structure helps to balance security with productivity, making it easy for individuals to understand and apply.

The Information Classification framework is enforced via specific technical controls, facilitated by the Microsoft security ecosystem, specifically Microsoft Purview and Defender.

For example, technical controls prevent “Restricted” data from being attached and/or sent to unauthorised third parties. Instead, file sharing must occur via a controlled link, and in certain scenarios, will be encrypted at REST, requiring the recipient to authenticate.

In addition, certain files can only be accessed if they meet specific authorisation criteria, which can include the posture (security/compliance status) of the endpoint.

By default, all emails are automatically classified (no individual action required). However, the individual can decide to reclassify, if required. All new or previously unclassified files require the individual to classify them before they can be accessed. This is a one-time task, which takes only a couple of seconds to complete.

Through this process, all newly created emails and files will be classified, with existing files being captured as they are viewed/modified or archived.

This approach was designed to enable pragmatic scale and minimise business disruption, avoiding the need for individuals to retrospectively label all files and leveraging automation to simplify the process wherever appropriate.

A key learning from the implementation is the importance of achieving the right balance between security and productivity. If the Information Classification framework is overly complex or the controls create too much friction, the risk of business disruption may outweigh the benefits of improved security.

As a result, we invested several months running broad pilots (covering a wide range of scenarios), with a strong focus on organisational change management, specifically training and education.

The outcome dramatically improves the security posture of the business, reducing the risk associated with malicious or accidental loss of unstructured data. It also provides a robust foundation for compliance reporting and future enhancements, with all data being proactively labelled.

security information_security cybersecurity data_protection information_classification data_loss_prevention dlp microsoft purview defender architecture artificial_intelligence ai lifeinwork

LifeinTECH

Unstructured Data Protection

Matthew Bull

Comments