Ensemble classifier - Matlab implementation

Description

Matlab implementation of the ensemble classifier as described in [1]. The first use of the ensemble in steganalysis (even though not fully automatized) appeared in [2].

There is no need to install anything, you can start using the function ensemble.m right away.

The usage of the program under different experimental setups is demonstrated in the attached example files. All needed feature files are also included. We highly recommend spending the time to go through these examples as they show how the program should be used for steganalysis experiments. Additional information can be found in the F.A.Q. section below.

The program is available for public use. Please, remember to recognize our work by citing [1].

Thank you.

Paper abstract

Today, the most accurate steganalysis methods for digital media are built as supervised classifiers on feature vectors extracted from the media. The tool of choice for the machine learning seems to be the support vector machine (SVM). In this paper, we propose an alternative and well known machine learning tool – ensemble classifiers – and argue that they are ideally suited for steganalysis. Ensemble classifiers scale much more favorably w.r.t. the number of training examples and the feature dimensionality with performance comparable to the much more complex SVMs. The significantly lower training complexity opens up the possibility for the steganalyst to work with rich (high-dimensional) cover models and train on larger training sets – two key elements that appear necessary to reliably detect modern steganographic algorithms. Ensemble classification is portrayed here as a powerful developer tool that allows fast construction of steganography detectors with markedly improved detection accuracy across a wide range of embedding methods. The power of the proposed framework is demonstrated on two steganographic methods that hide messages in JPEG images.

Contact

  • Jan Kodovský - jan (dot) kodovsky (at) binghamton (dot) edu
  • Jessica Fridrich - fridrich (at) binghamton (dot) edu
  • Vojtěch Holub - vholub1 (at) binghamton (dot) edu

Download

  • Version 2.0 (September 2013) NEW
  •    - download: ensemble_2.0.zip (13 MB)
  •    - includes the training function ensemble_training.m, the testing function ensemble_testing.m, and a
         simple tutorial tutorial.m; sample features included
  •    - The purpose of version 2.0 is to simplify everything as much as possible. Here is the list of the main
         modifications compared to the first version of the ensemble classifier:
    • • Instead of a single routine, we separated training form testing. This allows for more flexibility in
         the usage (for example for studying the effects of the cover-source mismatch).
    • • Training outputs the data structure 'trained_ensemble' which allows for easy storing of the
         trained classifier.
    • • Ensemble now doesn't accept paths to features any more. Instead, it requires the features
         directly (Xc - cover features, Xs - stego features). Xc and Xs must have the same dimension and
         must contain synchronized cover/stego pairs - see the attached tutorial for more details on this.
    • • There is no output into a log file. So there is no hard-drive access at all now.
    • • Since the training and testing routines were separated, our ensemble implementation no longer
         takes care of training/testing divisions. This is the responsibility of the user now. Again, see the
         attached tutorial for examples.
    • • Bagging is now always on
    • • We fixed the fclose bug (Error: too many files open)
    • • Covariance caching option was removed
    • • Added settings.verbose = 2 option (screen output of only the last row)
    • • Ensemble now works even if full dimension is equal to 1 or 2. If equal to 1, multiple decisions are still combined as different base learners are trained on different bootstrap samples (bagging).

  • Version 1.0:
  •    - download: ensemble_1.0.zip (75 MB)
  •    - includes the main file ensemble.m, 5 introductory examples, and the sample feature files (BOSSbase,
         HUGO algorithm, CC-PEV and SPAM features)

References

[1] J. Kodovský, J. Fridrich, and V. Holub, Ensemble Classifiers for Steganalysis of Digital Media. IEEE Transactions on Information Forensics and Security, Vol. 7, No. 2, pp. 432-444, April 2012. [pdf]

[2] J. Kodovský, and J. Fridrich, Steganalysis in high dimensions: fusing classifiers built on random subspaces. Proc. SPIE, Electronic Imaging, Media Watermarking, Security, and Forensics XIII, San Francisco, CA, January 23–26, 2011. [pdf] [slides]


F.A.Q.

Q: Do I need any additional packages, libraries or Matlab toolboxes?

A: No.


Q: What is the format of features used by the ensemble implementation?

A: Conveniently, we use Matlab's *.mat files. Every feature file must contain two variables: F and names. The variable F is a data matrix containing features in a row-by-row manner, i.e. the number of rows corresponds to the number of samples and the number of columns is the feature space dimensionality. The variable names is a cell array whose length is equal to the height of the matrix F and contains the corresponding image filenames from which the features were extracted. See the included tutorial for more details.


Q: I created a useful extension and I would like to contribute and make it public.

A: Send us your extension, together with its description and with a well-commented example.


Q: Is the ensemble really as accurate as SVMs?

A: According to our experiments with features and stego algorithms in both spatial and JPEG domains, the ensemble is in general as accurate as a linear SVM (or slightly better). Regarding the comparison with the Gaussian SVM, there may be a slight drop of performance if decision boundary is more complicated (non-linear).


Q: What are the main advantages of the ensemble over SVMs and other machine learning?

A: Speed. Period. The ensemble is more scalable w.r.t. the training set size and feature space dimensionality - it's complexity scales better with these two parameters (see [1]).



Last update: September 2013