[OC] # Network Structure Analysis: Detecting Anomalies in Redacted Public Records

(New to Reddit as contributor, have edited multiple times to fit format etc… if unacceptable will delete. I’d just be glad if 1 or 2 people read it.)

Method: Graph-theoretic analysis, cosine similarity profiling
Verification: MD5 c4baa2bb518d806b27ef649ac9cf1f46 | 19,977 edges | Fully reproducible

The Problem

When governments release documents with redactions, the public assumes anonymized entries are either privacy protection or collapsed "junk buckets." These assumptions are rarely tested mathematically.

The Method

If two anonymized nodes are both "junk buckets," they should have similar connection profiles (high cosine similarity). If one is a real, distinct entity, its profile will be orthogonal.

The Finding

[SEE IMAGE ABOVE^] (Image shows values ×100 for visual clarity)

Comparison	Cosine Similarity	Interpretation
Node A vs B	0.0184	ORTHOGONAL – completely distinct
Node A vs C	0.364	Low overlap
Node B vs C	0.9025	IDENTICAL – same artifact class

B and C: 0.9025 similarity = collapsed artifacts (same source)
A vs B: 0.0184 similarity = NOT an artifact (distinct entity)

Protection Asymmetry

Node	Connections	Visibility
Jeffrey Epstein	5,625	Visible
NODE A	3,923	HIDDEN
Donald Trump	1,685	Visible
Ehud Barak	701	Visible
NODE B	481	HIDDEN
NODE C	160	HIDDEN

Node A has 5.6x MORE connections than Barak but receives COMPLETE anonymization.

Edge Type Distribution

Type	Node A	Node B	Node C
Unknown	53.3%	71.0%	52.5%
Business	6.9%	1.2%	6.8%
Legal	5.8%	3.4%	5.8%
Communication	3.95%	4.78%	3.75%

Node A shows structured distribution = real entity
Node B shows 71% unknown = noise bucket

The Logic

QUESTION: Is Node A just another collapsed data bucket?

TEST: If A is a bucket like B and C, their profiles should match.

EVIDENCE:
– B vs C = 0.9025 cosine = SAME BUCKET
– A vs B = 0.0184 cosine = DIFFERENT ENTITY

CONCLUSION: Node A is statistically distinct. Not an artifact.

Reproduction

“`python
def cosine_similarity(p1, p2):
all_keys = set(p1.keys()) | set(p2.keys())
v1 = np.array([p1.get(k, 0) for k in all_keys])
v2 = np.array([p2.get(k, 0) for k in all_keys])
return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

Results:

A vs B: 0.0184 (distinct)

B vs C: 0.9025 (same bucket)

“`

What This Shows

Node A is statistically real, not a data artifact
Node A receives disproportionate protection relative to connectivity
The methodology is fully reproducible

What This Does NOT Show

Identity of any individual
Guilt or wrongdoing
Classified information

Hash: c4baa2bb518d806b27ef649ac9cf1f46

Structure, anomaly, methodology—not accusation.

Posted by Old_Iron986

2 comments

Old_Iron986 says:

2025-11-30 at 11:57

Data comes from the House Oversight Committee’s Epstein document release, processed through the maxandrews/Epstein-doc-explorer GitHub repo and converted to CSV for analysis. Visuals were generated in Python using matplotlib/pandas/numpy. Full reproducible code is included in the post.
zakats says:

2025-11-30 at 12:29

This might be better served in /r/science or /r/datascience.

Comments are closed.