Office 365/Exchange email deduplication problem

PrimoGeek · January 29, 2020, 2:37am

Acquiring Office 365 email from multiple clients. Many emails have the x500(?) version of the email address (the ones that have the “/ou=…/cn=” strings). Example: email in sent folder has x500 but the received email in another user’s inbox has regular email address. This makes it impossible to de-dupe messages that are identical but for the x500 vs regular email address issue. I assume there is some default setting that isn’t being changed. I also assume that once the email is sent you can’t “fix” the problem by converting the x500 to regular email addresses. Verified same issue using FEC or Office 365 eDiscovery module.

This is really affecting the ability to de-dupe and thread messages for eDiscovery purposes. We have been trying to work with message id but that doesn’t help threading. Has anyone else seen this behavior? Any suggestions?

agungor · January 29, 2020, 6:12pm

Good point, Sean. It is common to run into X.500 and IMCEA-encapsulated addresses when dealing with Exchange. You can look into the details of this by checking out how the LegacyExchangeDN attribute is used in Exchange routing.

It is often not a viable option to perform any normalization during the forensic preservation as you would be modifying the original evidence. So, dealing with de-duplication and threading are issues to be handled by your downstream tools during eDiscovery. A few thoughts on how these could be accomplished:

De-duplication

In some cases, you will find that the contact associated with a message has both its X.500 address and SMTP address available—in the PR_EMAIL_ADDRESS_W and PR_SMTP_ADDRESS_W MAPI properties respectively. When this is the case, the processing tool can (perhaps optionally) favor the SMTP address for email hash calculation to normalize the email addresses.

When both addresses are not available, my suggestion would be to feed the tool a mapping of X.500 addresses to SMTP addresses to facilitate the normalization. I am not aware of any off-the-shelf eDiscovery tools that do this at the moment.

Threading

The conversation index (PR_CONVERSATION_INDEX MAPI property) is a good choice for threading Exchange emails. This not only identifies the header message, but also gives you the origination date of the header message, the position of your message in the thread, and time differences among the messages.

On a related note, Gmail / G Suite messages have a slightly similar threadId attribute. FEC captures this during acquisitions and includes it in its output (Downloaded Items log). You could use threadId for threading Google emails similar to how you would use the conversation index to thread Exchange emails.

I would be very curious to hear the experiences of others in this area as well!

skeeble · February 11, 2020, 8:45pm

We run into this problem often. Most of the time it is when dealing with an email server and archiving platform. For eDiscovery purposes, we will create a custom metadata hash for our clients. This can be used to propagate coding in the event all unique need to be produced or identify a unique set of messages to be reviewed.

agungor · February 11, 2020, 9:39pm

Let’s say you have two copies of a message with the following from lines:

From: /O=EXCHANGELABS/OU=EXCHANGE ADMINISTRATIVE GROUP (FYDIBOHF23SPDLT)/CN=RECIPIENTS/CN=CA1E61A4432642EAB7BCD42E6E2C613B-JDOE
From: jdoe@example.com

When you create your custom hash, do you translate #1 to #2 so the two items can have the same custom hash? Another option I can think of might be to include only the participants’ names in your custom hash—along with things like subject, sent date, etc.—rather than their email addresses, but it sounds like this could increase false positives.