Here is a great tip from @PintAndClick: you can pipe the output of sigtool –find-sigs into sigtool –decode-sigs to get a nice breakdown of the signatures:

Here is a great tip from @PintAndClick: you can pipe the output of sigtool –find-sigs into sigtool –decode-sigs to get a nice breakdown of the signatures:
Searching through VirusTotal Intelligence, I found a couple of .iso files (CD & DVD images) containing a malicious EXE spammed via email like this one. Here is the attached .iso file (from May 25th 2017) on VirusTotal, with name “REQUEST FOR QUOTATION,DOC.iso”.
Recent versions of Windows will open ISO files like a folder, and give you access to the contained files.
I found Python library isoparser to help me analyze .iso files.
Here is how I use it interactively to look into the ISO file. I create an iso object from an .iso file, and then I list the children of the root object:
The root folder contains one file: DIALOG42.EXE.
Looking into the content of file DIALOG42.EXE, I see the header is MZ (very likely a PE file):
And I can also retrieve all the content to calculate the MD5 hash:
This is a quick & dirty Python script to dump the first file in an ISO image to stdout:
import isoparser import sys import os oIsoparser = isoparser.parse(sys.argv[1]) if sys.platform == 'win32': import msvcrt msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY) sys.stdout.write(oIsoparser.root.children[0].content)
This allows me to pipe the content into other programs, like pecheck.py:
An .iso file downloaded from the Internet (thus with a Zone.Identifier ADS) opened in Windows 10 will not propagate this “mark-of-the-web” to the contained files.
Here is an example with file demo.iso, marked as downloaded from the Internet:
When this file is opened (double-clicked), it is mounted as a drive (E: in this example), and we see the content (a Word document: demo.docx):
This file is not marked as downloaded from the Internet:
Word does not open it in Protected View:
You probably know that I like to pipe commands together when I analyze malware …
Are you familiar with Windows’ clip command? It’s a very simple command that I use often: it reads input from stdin and copies it to the Windows clipboard.
Here is an example where I use it to copy all the VBA code of a malicious Word document to the clipboard, so that I can paste it into a text editor without having to write it to disk.
I was asked if malware authors can abuse autorun.inf files in .ISO files: no, nothing will execute automatically when you open an .ISO file with autorun.inf file in Windows 8 or 10.
I have videos to illustrate this:
This is how I deploy and configure ClamAV on Windows:
I download the portable Windows x64 version in a ZIP file (clamav-0.99.2-x64.zip).
I extract the content of this ZIP file to folder c:\portable\, this will create a subfolder ClamAV-x64 containing ClamAV.
Then I copy the 2 samples for the config files:
copy c:\portable\ClamAV-x64\conf_examples\clamd.conf.sample c:\portable\ClamAV-x64\clamd.conf
copy c:\portable\ClamAV-x64\conf_examples\freshclam.conf.sample c:\portable\ClamAV-x64\freshclam.conf
I create a database folder (to contain the signature files):
mkdir c:\portable\ClamAV-x64\database
I edit file c:\portable\ClamAV-x64\freshclam.conf:
Line 8: #example
Line 13: DatabaseDirectory c:\portable\ClamAV-x64\database
Now I can run freshclam.exe to download the latest signatures:
Then I edit file c:\portable\ClamAV-x64\clamd.conf:
Line 8: #example
Line 74: DatabaseDirectory c:\portable\ClamAV-x64\database
And now I can run clamscan.exe to scan a sample:
@futex90 shared a sample with me detected by many anti-virus programs on VirusTotal but, according to oledump.py, without VBA macros:
I’ve seen this once before: this is a malicious document that has been cleaned by an anti-virus program. The macros have been disabled by orphaning the streams containing macros, just like when a file is deleted from a filesystem, it’s the index that is deleted but not the content. FYI: olevba will find macros.
Using the raw option, it’s possible to extract the macros:
I was able to find back the original malicious document: f52ea8f238e57e49bfae304bd656ad98 (this sample was analyzed by Talos).
The anti-virus that cleaned this file, just changed 13 bytes in total to orphan the macro streams and change the storage names:
This can be clearly seen using oledir:
I sometimes retrieve malware over Tor, just as a simple trick to use another IP address than my own. I don’t do anything particular to be anonymous, just use Tor in its default configuration.
On Linux, its easy: I install tor and torsocks packages, then start tor, and use wget or curl with torsocks, like this:
torsocks wget URL torsocks curl URL
On Windows, its a bit more difficult, because the torsocks trick doesn’t work.
I run Tor (Windows Expert Bundle) without any configuration:
This will give me a Socks listener, that curl can use:
curl --socks5-hostname 127.0.0.1:9050 http://www.didierstevens.com
option –socks5-hostname makes curl use the Socks listener provided by Tor to make connections and perform DNS requests (option –socks5 does not use the Socks listener for DNS request, just for connections).
wget has no option to use a Socks listener, but it can use an HTTP(S) proxy.
Privoxy is a filtering proxy that I can use to help wget to talk to Tor like this.
I make 2 changes to Privoxy’s configuration config.txt:
1) I change line 811 from “toggle 1” to “toggle 0” to configure Privoxy as a normal proxy, without filtering.
2) I add this line 1363: “forward-socks5t / 127.0.0.1:9050 .”, this makes Privoxy use Tor.
Then I launch Privoxy:
And then I can use wget like this:
wget -e use_proxy=yes -e http_proxy=127.0.0.1:8118 -e https_proxy=127.0.0.1:8118 URL
Port 8118 is Privoxy’s port. If you want, you can also put these options in a configuration file.
Often, my wget command will be a bit more complex (I’ll explain this in another blog post, but it’s based on this ISC diary entry):
wget -d -o 01.log -U "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" -e use_proxy=yes -e http_proxy=127.0.0.1:8118 -e https_proxy=127.0.0.1:8118 --no-check-certificate URL
I can also use Tor browser in stead of Tor, but then I need to connect to port 9150.
With most of my tools, I try to support input via STDIN.
It’s also possible to provide JavaScript scripts for parsing to SpiderMonkey via STDIN. You can pass filename – to js for processing STDIN input:
I often store malware in password protected ZIP files, these files can be analyzed too provided you use zipdump.py:
And with option -e, it’s also possible to change output type via the command line:
I added a new option (-I, –ignorehex) to base64dump.py to make the extraction of the PE file inside a JScript script generated with DotNetToJScript a bit easier.
DotNetToJScript is James Forshaw‘s “tool to generate a JScript which bootstraps an arbitrary .NET Assembly and class”.
Here is an example of a script generated by James’ tool:
The serialized .NET object is embedded as a string concatenation of BASE64 strings, assigned to variable serialized_obj.
With re-search.py, I extract all strings from the script (e.g. strings delimited by double quotes):
The first 3 strings are not part of the BASE64 encoded object, hence I get rid of them (there are no unwanted strings at the end):
And now I have BASE64 characters, I just have to get rid of the doubles quotes and the newlines (base64dump searches for continuous strings of BASE64 characters). With base64dump‘s -w option I can get rid of whitespace (including newlines), and with option -i I can get rid of the double-quote character. Unfortunately, escaping of this character (\”) works on Windows, but then cmd.exe gets confused for the next pipe (it expects a closing double-quote). That’s why I introduced option -I, to specify characters with their hexadecimal value. Double-quote is 0x22, thus I use option -I 22:
This is the serialized object, and it contains the .NET assembly I want to analyze. .NET assemblies are .DLLs, e.g. PE files. With my YARA rule to detect PE files, I can find it inside the serialized data:
A PE file was found, and it starts at position 0x04C7. I can cut this data out with option -c:
Another method to find the start of the PE file, is to use a cut expression that searches for ‘MZ’, like this:
If there is more than one instance of string MZ, different cut-expressions must be tried to find the real start of the PE file. For example, this is the cut-expression to select data starting with the second instance of string MZ: -c “[‘MZ’]2:”
It’s best to pipe the cut-out data into pecheck, to validate that it is indeed a PE file:
pecheck also helps with finding the length of the PE file (with the given cut-expression, I select all data until the end of the serialized data).
Remark that there is an overlay (bytes appended to the end of the PE file), and that it starts at position 0x1400. Since I don’t expect an overlay in this .NET assembly, the overlay is not part of the PE file, but it is part of the serialization meta data.
Hence I can cut out the PE file precisely like this:
This PE file can be saved to disk now for reverse-engineering.
I have not read the .NET serialization format specification, but I can make an educated guess. Right before the PE file, there is the following data:
Remark the first 4 bytes (5 bytes before the beginning of the PE file): 00 14 00 00. That’s 0x1400 as a little-endian 32-bit integer, exactly the length of the PE file, 5120 bytes:
So that’s most likely another method to determine the length of the PE file.
In my malware analysis blog posts and videos, I always try to include the hash or VirusTotal link of the sample(s) I analyze. If I don’t, it means I’m not at liberty to share the hash.
For every video that I post on YouTube, I create a corresponding video blog post (https://videos.DidierStevens.com) with more info like the sample’s hash and a link to VirusTotal.
In the description of the YouTube video, you will find a link to the video blog post.
Example:
I will often use the MD5 hash, but since I include a link to VirusTotal, you can consult the report and find other hashes like sha256 in that report.
Regarding MD5: I don’t worry about hash collisions for malware samples. Actually, if there is an MD5 hash collision, VirusTotal will inform me, and that would make my day .
Don’t ask me for the malware samples I analyze, I don’t host or send these malware samples. If you or your organization have a VirusTotal Intelligence subscription, you can download the sample from VirusTotal.
If you don’t, there are several free repositories online (sometimes they require free registration). Lenny Zeltser has a list of repositories.
I got hold of a phishing PDF where the /URI is hiding inside a stream object (/ObjStm).
First I start the analysis with pdfid.py:
There is no /URI reported, but remark that the PDF contains 5 stream objects (/ObjStm). These can contain /URIs. In the past, I would search and decompress these stream objects with pdf-parser.py, and then pipe the result through pdfid.py, in order to detect /URIs (or other objects that require further analysis).
Since pdf-parser.py version 0.7.0, I prefer another method: using option -O to let pdf-parser.py extract and parse the objects inside stream objects.
With option -a (here combined with option -O), I can get statistics and keywords just like with pdfid:
Now I can see that there is a /URI inside the PDF (object 43).
Thus I can use option -k to get the value of /URI entries, combined with option -O to look inside stream objects:
And here I have the /URI.
Another method, is to select object 43:
From this output, we also see that object 43 is inside stream object 16.
Remark: if you use option -O on a PDF that does not contain stream objects (/ObjStm), pdf-parser will behave as if you didn’t provide this option. Hence, if you want, you can always use option -O to analyze PDFs.
MD5 007de2c71861a3e1e6d70f7fe8f4ce9b is a malicious document: a spreadsheet with Excel 4.0 macros.
Excel 4.0 macros predate VBA macros: they are composed of functions placed inside cells of a macro sheet.
These macros are not stored in dedicated VBA streams, but as BIFF records in the Workbook stream.
Spreadsheets with Excel 4.0 macros can be analyzed with oledump.py and plugin plugin_biff.py.
Option -x of plugin_biff will select all BIFF records relevant for the analysis of Excel 4.0 macros:
In this output, we have all the BIFF records necessary to 1) determine that this is a malicious document and 2) report what this maldoc does.
The first BIFF record, BOUNDSHEET, tells us that the spreadsheet contains a Excel 4.0 macro sheet that is hidden.
The third BIFF LABEL record tells us that there is a cell with name Auto_Open: the macros will execute when the spreadsheet is opened.
And then we have BIFF FORMULA records that tell us that something is CONCATENATEd and EXECuted.
The BIFF STRING record provides us with the exact command (msiexec …) that will be executed.
The latest version of plugin_biff contains much larger lists of tokens and functions used in formula expressions. Of course, it’s still possible that tokens and/or functions are used unknown by my plugin. This is now clearly indicated in the output:
*UNKNOWN FUNCTION* is reported when a function number is unknown. The function number is always reported. Here, for the sake of this example, a crippled version of plugin_biff reports functions with number 0x0037 and 0x0150. In the released version of plugin_biff, functions 0x0037 and 0x0150 are identified as RETURN and CONCATENATE respectively.
*INCOMPLETE FORMULA PARSING* is reported when a formula expression can not be fully parsed. Left of the warning *INCOMPLETE FORMULA PARSING*, the partially parsed expression can be found, and right of the warning, the remaining, unparsed expression is reported as a Python string. If the remainder contains bytes that could be potentially dangerous functions like EXEC, then this is reported too.
The complete analysis of the maldoc is explained in this video:
A couple of years ago, I wrote a Python script to enhance Radare2 listings: the script extract strings from stack frame instructions.
Recently, I combined my tools to achieve the same without a 32-bit disassembler: I extract the strings directly from the binary shellcode.
What I’m looking for is sequences of instructions like this: mov dword [ebp – 0x10], 0x61626364. In 32-bit code, that’s C7 45 followed by one byte (offset operand) and 4 bytes (value operand).
Or: C7 45 10 64 63 62 61. I can write a regular expression for this instruction, and use my tool re-search.py to extract it from the binary shellcode. I want at least 2 consecutive mov … instructions: {2,}.
I’m using option -f because I want to process a binary file (re-search.py expects text files by default).
And I’m using option -x to produce hexadecimal output (to simplify further processing).
I want to get rid of the bytes for the instruction and the offset operand. I do this with sed:
I could convert this back to text with my tool hex-to-bin.py:
But that’s not ideal, because now all characters are merged into a single line.
My tool python-per-line.py gives a better result by processing this hexadecimal input line per line:
Remark that I also use function repr to escape unprintable characters like 00.
This output provides a good overview of all API functions called by this shellcode.
If you take a close look, you’ll notice that the last strings are incomplete: that’s because they are missing one or two characters, and these are put on the stack with another mov instruction for single or double bytes. I can accommodate my regular expression to take these instructions into account:
This is the complete command:
re-search.py -x -f "(?:\xC7\x45.....){2,}(?:(?:\xC6\x45..)|(?:\x66\xC7\x45...))?" shellcode.bin.vir | sed "s/66c745..//g" | sed "s/c[67]45..//g" | python-per-line.py -e "import binascii" "repr(binascii.a2b_hex(line))"
virustotal-search.py is a tool to query VirusTotal via its public API for file reports by providing hashes to search for.
This new version adds searching for URLs. Use option -t to select the type of search you want: file (default) or url.
Like this:
Option -e can be used to include extra information (present in the JSON reply) not included by default.
For example, a default file search does not include sha256 hashes:
But you can include it with option “-e sha256” like this:
The public API can also be used for queries for domain names and IP addresses. These queries are much simpler than file and url, and therefor, I developed a very generic program to query APIs. This will be released soon.
virustotal-search_V0_1_5.zip (https)
MD5: 2155347687726A321D1ADBB9C9B81CFD
SHA256: 4F614C9D01C694AEAA16F7D5E4DBFBCF37E8E8D01D382C1137F401612D02E110
amsiscan.py is a Python script that uses Windows 10’s AmsiScanBuffer function to scan input for malware.
It reads one or more files or stdin.
The AmsiScanBuffer function returns 5 possible values when it is called for a scan:
AMSI_RESULT_CLEAN
AMSI_RESULT_NOT_DETECTED
AMSI_RESULT_BLOCKED_BY_ADMIN_START
AMSI_RESULT_BLOCKED_BY_ADMIN_END
AMSI_RESULT_DETECTED
Example:
amsiscan_V0_0_1.zip (https)
MD5: 47E50599E0CFAF1D27416E68394289A0
SHA256: 044E41D7F31D8333CB5295FD6E430933CA67F9AC37CD400D38189C96AE48544D
This is an update of my PDF tools.
There are a couple of bug fixes for pdf-parser and pdfid.
And 2 new features in pdf-parser, inspired by a private training on maldoc analysis I gave last week. I often get good ideas from my students, and sometimes, even I get a good idea in class .
Option -o can now be used to select multiple objects: separate the indices by a comma.
There’s a new environment variable, PDFPARSER_OPTIONS, that can be used to provide extra options you want to include with each execution of pdf-parser.py. This is useful for option -O, an option to parse stream objects.
It’s actually best to always parse stream objects, i.e. always use option -O. But I decided not to make this an option that is on by default, so that the behavior of pdf-parser would remain unchanged. I consider this important for the many people that rely on a predictable behavior of pdf-parser, like teachers and students of infosec trainings where my tools are used/mentioned.
However, always including option -O is tedious and error prone. So now you can have best of both worlds, by defining an environment variable with name PDFPARSER_OPTIONS and value -O.
And finally, I started to add a man page (option -m), like I do with many of my other tools. This is a work in progress: for the moment, it points to my free PDF analysis e-book that explains the use of pdfid and pdf-parser.
pdf-parser_V0_7_3.zip (https)
MD5: 7EB1713631D255B36BC698CD2422C7EB
SHA256: D4D5AC9C26A9D8FEF65CE58A769D3F64A737860DC26606068CCDD3F04FDEA0D7
pdfid_v0_2_6.zip (https)
MD5: 9CCE332914A6C76410F04B7C35DA3155
SHA256: 95F7C91EEFB561F3F3BE9809ED339D85E7109BAA7E128EF056651EE018DBDBA0
ExifTool can misidentify VBA macro files as FlashPix files.
The binary file format of Office documents (.doc, .xls) uses the Compound File Binary Format, what I like to refer as OLE files. These files can be analyzed with my tool oledump.py.
Starting with Office 2007, the default file format (.docx, .docm, .xlsx, …) is Office Open XML: OOXML. It’s in essence a ZIP container with XML files inside. However, VBA macros inside OOXML files (.docm, .xlsm) are not stored as XML files, they are still stored inside an OLE file: the ZIP container contains a file with name vbaProject.bin. That is an OLE file containing the VBA macros.
This can be observed with my zipdump.py tool:
oledump.py can look inside the ZIP container to analyze the embedded vbaProject.bin file:
And of course, it can handle an OLE file directly:
When ExifTool is given a vbaProject.bin file for analysis, it will misidentify it as a picture file: a FlashPix file.
That’s because when ExifTool doesn’t have enough metadata or an identifying extension to identify an OLE file, it will fall back to FlashPix file detection. That’s because FlashPix files are also based on the OLE file format, and AFAIK ExifTool started out as an image tool:
That is why on VirusTotal, vbaProject.bin files from OOXML files with macros, will be misidentified as FlashPix files:
When the extension of a vbaProject.bin file is changed to .doc, ExifTool will misidentify it as a Word document:
ExifTool is not designed to identify VBA macro files (vbaProject.bin). These files are not Office documents, neither pictures. But since they are also OLE files, ExifTool tries to guess what they are, based on the extension, and if that doesn’t help, it falls back to the FlashPix file format (based on OLE).
There’s no “bug” to fix, you just need to be aware of this particular behavior of ExifTool: it is a tool to extract information from media formats, when it analyses an OLE file and doesn’t have enough metadata/proper file extension, it will fall back to FlashPix identification.
I was reading about malware using WAV files and steganography to download payloads without triggering detection systems.
For example, here is a WAV file with a hidden, embedded PE file. The PE file is encoded in the least significant bit of 16-bit integers that encode PCM sound.
I was wondering how I could extract this embedded file with my tools. There was no easy solution, because many of my tools operate on byte streams, but here I have to operate on a bit stream. So I made an update to my format-bytes.py tool.
Using my tool file-magic.py, I get confirmation that this is a sound file (.WAV) with 16-bit PCM data.
And here is an ASCII/HEX dump of the beginning of the file made with cut-bytes.py:
The data chunk starts with magic sequence ‘data’ (in yellow), followed by the size of the data chunk (in green), and then the data itself: 16-bit, little-endian signed integers (in red).
To extract the least significant bit of each 16-bit, little-endian signed integer and assemble them into bytes, I use the latest version of format-bytes.py.
This is the command that I use:
format-bytes.py -a -f “bitstream=f:<H,b:0,j:<” #c#[‘data’]+8: DB043392816146BBE6E9F3FE669459FEA52A82A77A033C86FD5BC2F4569839C9.wav.vir
With option -f, I specify a bitstream format.
f:<H means that the format of the data is little-endian (<), unsigned 16-bit integers (H). I could also specify a signed 16-bit integer (h), but this doesn’t matter here, as I’m not going to use the sign of the integers.
b:0 means that I extract the least-significant bit (position 0) of each 16-bit integer.
j:< means that I assemble (join) these bits into bytes from least significant to most significant (<).
The data starts 8 bytes into the data chunk, e.g. 8 bytes after magic sequence ‘data’. I define this with cut-expression #c#[‘data’]+8:.
When I run this command, and perform an ASCII dump, I get this output for the beginning of the stream:
I can indeed see an executable (MZ), but it is preceded by 4 bytes. These 4 bytes are the length of the embedded file. As described in the article, the length is big-endian encoded. Hence I use a similar command to extract the length, but with j:>, as can be seen here:
The length is 733696 bytes, and this matches the IOCs from the article.
Then I use my tool pecheck.py to search for PE files inside the byte stream (-l P), like this:
MD5 7cb0e1e2cf4a9bf450a350a759490057 is indeed the hash of the malicious DLL encoded in this WAV file.
AutoCAD’s drawing files (.dwg) can contain VBA macros. The .dwg format is a proprietary file format. There is some documentation, for example here.
When VBA macros are stored inside a .dwg file, an OLE file is embedded inside the .dwg file. There’s a quick-and-dirty way to find this embedded file inside the .dwg file: search for magic sequence D0CF11E0.
My tool cut-bytes.py can be used to search for the first occurrence of byte sequence D0CF11E0 and extract all bytes starting from this sequence until the end of the .dwg file. This can be done with cut-expression [D0CF11E0]: and pipe the result into oledump.py, like this:
Next, oledump can be used to conduct the analysis as usual, for example by extracting the VBA macro source code:
There is also a more structured approach to locate the embedded OLE file inside a .dwg file. When one looks at a .dwg file with a hexadecimal editor, the following can be seen:
First there is a magic sequence identifying this as a .dwg file: AC1032. This sequence varies with the file format version, but since many, many years, it starts with AC10. You can find more details regarding this magic sequence here and here.
At position 0x24 (36 decimal), there is a 32-bit little-endian integer. This is a pointer to the embedded OLE file (this pointer is NULL when no OLE file with VBA macros is embedded).
In our example, this pointer value is 0x00008080. And here is what can be found at this position inside the .dwg file:
First there is a 16-byte long header. At position 8 inside this header, there is a 32-bit little-endian integer that represents the length of the embedded file. 0x00001C00 in our example. And after the header one can find the embedded OLE file (notice magic sequence D0CF11E0).
This information can then be used to extract the OLE file from the .dwg like, like this:
Achieving exactly he same result as the quick-and-dirty method. The reason we don’t have to figure out the length of embedded OLE the file using the quick-and-dirty method, is that oledump ignores all bytes appended to an OLE file.
I will adapt my oledump.py tool to extract macros directly from .dwg files, without the need of a tool like cut-bytes.py, but I will probably implement something like the quick-and-dirty method, as this method would potentially work for other file formats with embedded OLE files, not only .dwg files.