Loading...

FAQ: features extraction

Features Extraction FAQ

Feature extraction is an optional capability that strengthens verification and reuse checks for images, text, and audio — especially when the content was edited or re-encoded.

What is “features extraction”?

Features extraction turns content into compact, comparable signals (for example: perceptual hashes, histograms, keypoints, texture descriptors, and audio fingerprints). Unlike a plain cryptographic hash (which changes after any edit), these signals can remain comparable across common transformations such as resizing, compression, minor edits, cropping, or re-encoding.

How does it relate to a proof record, timestamp, and C2PA?

  • Proof record + timestamp answers: “this specific fingerprint existed at time X.”
  • C2PA export supports provenance workflows (credentials/manifest tied to the content workflow).
  • Extracted features help answer: “how similar is this content to the original, even after edits?”

Together, these elements form a stronger evidence chain for delivery disputes, plagiarism claims, or reuse investigations: time anchor + provenance artifact + similarity signals.

What do you extract from images?

Image extraction is organized into layers to keep it fast by default and only go deeper when needed.

Coarse Fast, broad filters for grouping and quick similarity signals:

  • Perceptual hashing (pHash) — robust “looks similar” fingerprint under resizing/compression.
  • Histogram analysis — rough comparison of color/brightness distribution.
  • Edge detection (Sobel / Canny) — structural outlines that still work when colors change.
  • Adaptive thresholding (Gaussian / Otsu) — shape-like structure signals.
  • Statistical features — quick descriptive characteristics useful for screening.
  • Fourier spectrum — frequency-domain signature that can highlight repeated patterns.

Intermediate More precise matching for edited/cropped content:

  • SIFT — local keypoints/descriptors for similarity under crop/scale/rotation.
  • ORB — keypoint-based method optimized for speed and practical matching.
  • PCA reduction — compresses descriptors to reduce storage/compute while keeping match power.
  • SURF — not enabled in free mode (licensing constraints).

Fine Deeper signals when you need more confidence:

  • Texture analysis — texture descriptors for subtle edit/synthesis patterns.
  • GBM-based scoring — can combine multiple signals into a stronger similarity score.

What do you extract from text?

  • Normalization — reduces noise to make comparisons more stable.
  • Letter frequency histogram — lightweight signature for rough similarity checks.
  • Histogram hash — compact digest for efficient storage/comparison.

This is intentionally conservative and fast. If you need stronger text similarity for disputes, we can propose advanced options (e.g., n-grams/shingles, structure-aware checks, semantic methods).

What do you extract from audio?

  • MFCC histogram — compact fingerprint of timbre used widely in audio similarity.
  • Spectrogram — time–frequency representation that supports deeper analysis.
  • Chromagram — pitch-class distribution useful for music-like similarity patterns.

Audio may be normalized to WAV internally to make extraction stable across formats (e.g., m4a → wav).

Can this help in plagiarism or reuse disputes?

It can provide strong technical foundations, especially when the reused content was modified. A typical pattern is: the exact file hash no longer matches after edits, but robust features still indicate high similarity under defined methods. In practice, the most reliable approach is multi-signal evidence (not a single metric).

  • Timestamped proof supports “I had this content (fingerprint) at time X”.
  • C2PA supports provenance and publishing traceability.
  • Feature signals support “this appears derived/similar despite transformations”.

Is it guaranteed to prove copying?

No single method universally “proves copying” in every case. Similarity methods show measurable closeness under defined metrics. For higher-stakes cases, we recommend combining multiple independent signals and producing a structured, repeatable report.

What’s available today vs. expanded options?

Available Feature extraction (image/text/audio) as described above, tied to your verification workflow.
On request Higher precision configurations, stronger reporting, dispute-grade packages, and higher-volume automation.

Need a stronger evidence configuration?

If you expect disputes, tell us your content type, volume, and what “reuse” looks like in your domain. We’ll propose the right mix of proofs, retention, exports, and similarity signals.

Contact us