[OC] # Network Structure Analysis: Detecting Anomalies in Redacted Public Records

(New to Reddit as contributor, have edited multiple times to fit format etc… if unacceptable will delete. I’d just be glad if 1 or 2 people read it.)

Method: Graph-theoretic analysis, cosine similarity profiling
Verification: MD5 c4baa2bb518d806b27ef649ac9cf1f46 | 19,977 edges | Fully reproducible


The Problem

When governments release documents with redactions, the public assumes anonymized entries are either privacy protection or collapsed "junk buckets." These assumptions are rarely tested mathematically.


The Method

If two anonymized nodes are both "junk buckets," they should have similar connection profiles (high cosine similarity). If one is a real, distinct entity, its profile will be orthogonal.


The Finding

[SEE IMAGE ABOVE] (Image shows values ×100 for visual clarity)

Comparison Cosine Similarity Interpretation
Node A vs B 0.0184 ORTHOGONAL – completely distinct
Node A vs C 0.364 Low overlap
Node B vs C 0.9025 IDENTICAL – same artifact class

B and C: 0.9025 similarity = collapsed artifacts (same source)
A vs B: 0.0184 similarity = NOT an artifact (distinct entity)


Protection Asymmetry

Node Connections Visibility
Jeffrey Epstein 5,625 Visible
NODE A 3,923 HIDDEN
Donald Trump 1,685 Visible
Ehud Barak 701 Visible
NODE B 481 HIDDEN
NODE C 160 HIDDEN

Node A has 5.6x MORE connections than Barak but receives COMPLETE anonymization.


Edge Type Distribution

Type Node A Node B Node C
Unknown 53.3% 71.0% 52.5%
Business 6.9% 1.2% 6.8%
Legal 5.8% 3.4% 5.8%
Communication 3.95% 4.78% 3.75%

Node A shows structured distribution = real entity
Node B shows 71% unknown = noise bucket


The Logic

QUESTION: Is Node A just another collapsed data bucket?

TEST: If A is a bucket like B and C, their profiles should match.

EVIDENCE:
– B vs C = 0.9025 cosine = SAME BUCKET
– A vs B = 0.0184 cosine = DIFFERENT ENTITY

CONCLUSION: Node A is statistically distinct. Not an artifact.


Reproduction

“`python
def cosine_similarity(p1, p2):
all_keys = set(p1.keys()) | set(p2.keys())
v1 = np.array([p1.get(k, 0) for k in all_keys])
v2 = np.array([p2.get(k, 0) for k in all_keys])
return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

Results:

A vs B: 0.0184 (distinct)

B vs C: 0.9025 (same bucket)

“`


What This Shows

  • Node A is statistically real, not a data artifact
  • Node A receives disproportionate protection relative to connectivity
  • The methodology is fully reproducible

What This Does NOT Show

  • Identity of any individual
  • Guilt or wrongdoing
  • Classified information

Hash: c4baa2bb518d806b27ef649ac9cf1f46

Structure, anomaly, methodology—not accusation.

Posted by Old_Iron986

2 comments
  1. Data comes from the House Oversight Committee’s Epstein document release, processed through the maxandrews/Epstein-doc-explorer GitHub repo and converted to CSV for analysis. Visuals were generated in Python using matplotlib/pandas/numpy. Full reproducible code is included in the post.

  2. This might be better served in /r/science or /r/datascience.

Comments are closed.