Python 3 compatibility by frenzymadness · Pull Request #19 · unixfreak0037/officeparser

frenzymadness · 2019-10-24T10:52:16Z

Hello.

I am trying o make this tool Python 3 compatible while keeping backward compatibility with Python 2.7. I've tested my work with three scenarios and one testing Word document. I am not a user of this tool so I just compared the output for Python 2 and 3 and it seems to be okay.

Tested commands:

officeparser.py --create-manifest --extract-streams test/test.doc
officeparser.py --dump-stream-by-name=WordDocument test/test.doc
officeparser.py --print-streams test/test.doc

If you find something missing, please provide a reproducer (shell command) so I can use it to test my work and backward compatibility.

Fixes: #18

… length

xambroz · 2019-10-25T18:24:55Z

Thank you very much for help it is cool.
I am adding 1 more patch for xrange, 1 more for the "to_hex" output where the usage of binary strings would clash, and one little cosmetics patch to get rid of the annoying error for not passing a filename when officeparser executed without parameters.
Still testing.

frenzymadness · 2019-10-25T18:27:48Z

@xambroz I can give you commit rights to my repository so you can continue there and your commits appear here. What do you think?

xambroz · 2019-10-26T18:26:39Z

Thank you - that would work. I will add what I have.

Currently I have patches which make it work on plain office file.

The only thing which I know is not working yet is the extraction of macroes, but I hope to fix that as well.

xambroz · 2019-10-27T12:00:13Z

In the meanwhile - this is what I have to add at this point:
frenzymadness#1

I know that --export-macros is not working in python3.
Tested like this:

download malware sample xls with macros from hybrid-analysis.com
https://www.hybrid-analysis.com/sample/8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f?environmentId=100
gunzip the file

gunzip 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin.gz

try with python2

python2 officeparser.py --extract-macros 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin
Traceback (most recent call last):
  File "officeparser.py", line 1235, in <module>
    _main()
  File "officeparser.py", line 836, in _main
    buffer = StringIO()
NameError: global name 'StringIO' is not defined

try with python3

$ python3 $(which officeparser.py) --extract-macros 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin 
Traceback (most recent call last):
  File "/usr/bin/officeparser.py", line 1234, in <module>
    _main()
  File "/usr/bin/officeparser.py", line 835, in _main
    buffer = StringIO()
NameError: name 'StringIO' is not defined

Even including "from io import StringIO" is not directly fixing the situation:
5) try with python2 and io.StringIO

python2 officeparser.py --extract-macros 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin
Traceback (most recent call last):
  File "officeparser.py", line 1235, in <module>
    _main()
  File "officeparser.py", line 837, in _main
    buffer.write(ofdoc.get_stream(project.index))
TypeError: unicode argument expected, got 'str'

try with python3 and io.StringIO

python3 officeparser.py --extract-macros 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin
Traceback (most recent call last):
  File "officeparser.py", line 1235, in <module>
    _main()
  File "officeparser.py", line 837, in _main
    buffer.write(ofdoc.get_stream(project.index))
TypeError: string argument expected, got 'bytes'

The original (cStringIO.StringIO) gives this:

$ python2 officeparser.py.orig --extract-macros 8db9495dcd5b9ed6a8f1844ffc496f3eb282eb323a00e6d4aa92c58710c5890f.bin 
INFO: Saving VBA code to ./Sem_1.cls
INFO: Saving VBA code to ./Page1_1.cls
INFO: Saving VBA code to ./Module1_1.bas
INFO: Saving VBA code to ./UserForm1_1.frm
INFO: Saving VBA code to ./Module2_1.bas
INFO: Saving VBA code to ./Module3_1.bas
INFO: Saving VBA code to ./UserForm6_1.frm
INFO: Saving VBA code to ./Page11_1.cls
INFO: Saving VBA code to ./Module6_1.bas
INFO: Saving VBA code to ./Module5_1.bas
INFO: Saving VBA code to ./Module4_1.bas
INFO: Saving VBA code to ./Class1_1.cls
INFO: Saving VBA code to ./Sheet1_1.cls

for python2 alias range to xrange

This fixes annoying bug/feature that the script crashes when no file attribute is provided

…python3 binarray to ascii/hexdump This fixes issue with --print-header and --print-directory

frenzymadness · 2019-11-11T06:32:29Z

I am investigating the macros extraction.

The very first question I need an answer for is whether PROJECT stream in a document should be handled as bytes or Unicode. Because now it's mixed and that's the reason why it does not work in Python 3. Do I understand it correctly that it contains some code in VB script so it should be handled as Unicode?

xambroz · 2019-11-14T14:25:53Z

Hello,
yes PROJECT stream in Office documents seems to hold metadata about the macros in the plaintext form in the INI format.

$ python2 officeparser.py.orig --dump-stream-by-name PROJECT word_form.doc 
ID="{F71D9A8C-3763-458D-A309-7E5E41C49A1A}"
Document=ThisDocument/&H00000000
Module=NewMacros
Name="Project"
HelpContextID="0"
VersionCompatible32="393222000"
CMG="C1C327AD2BAD2BAD2BAD2B"
DPB="828064A724A824A824"
GC="4341A5E667E767E798"

[Host Extender Info]
&H00000001={3832D640-CF90-11CF-8E43-00A0C911005A};VBE;&H00000000
&H00000002={000209F2-0000-0000-C000-000000000046};Word8.0;&H00000000

[Workspace]
ThisDocument=46, 46, 678, 454, 
NewMacros=69, 69, 678, 506, Z

frenzymadness · 2020-01-28T10:23:43Z

Hello.

Unfortunately, I don't have the capacity to work on this anymore. Could we please merge this PR to make the officeparser at least partially Python 3 compatible so others can continue without repeating the same work?

frenzymadness added 7 commits October 24, 2019 10:03

Fix prints - from statement to function

9f080d4

Use BytesIO from compatible io module instead of StringIO

b64db92

Different parsing bytes to name for Python 2/3

032a0d0

Use floor division where we need integer as a result

6afb84b

chunk is '' in Python 2 and b'' in Python 3 so better is to check its…

ace8691

… length

Use compatible way for writing binary data to stdout

0a4c018

Write bytes to file opened as binary

fa56b4b

rpmbuild and others added 4 commits October 31, 2019 13:41

patch xrange to range

bf03730

for python2 alias range to xrange

Print --help when no options are entered.

d10f9c4

This fixes annoying bug/feature that the script crashes when no file attribute is provided

Separate functions for the conversion of the python2 binary string / …

55e1022

…python3 binarray to ascii/hexdump This fixes issue with --print-header and --print-directory

OLE_SIGNATURE should be handled as bytes (no-op in Python 2)

bf793e8

frenzymadness mentioned this pull request Jan 28, 2020

Fixes for Python3 compatibility #20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python 3 compatibility#19

Python 3 compatibility#19
frenzymadness wants to merge 11 commits intounixfreak0037:masterfrom
frenzymadness:py3

frenzymadness commented Oct 24, 2019 •

edited

Loading

Uh oh!

xambroz commented Oct 25, 2019

Uh oh!

frenzymadness commented Oct 25, 2019

Uh oh!

xambroz commented Oct 26, 2019

Uh oh!

xambroz commented Oct 27, 2019

Uh oh!

frenzymadness commented Nov 11, 2019

Uh oh!

xambroz commented Nov 14, 2019

Uh oh!

frenzymadness commented Jan 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

frenzymadness commented Oct 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xambroz commented Oct 25, 2019

Uh oh!

frenzymadness commented Oct 25, 2019

Uh oh!

xambroz commented Oct 26, 2019

Uh oh!

xambroz commented Oct 27, 2019

Uh oh!

frenzymadness commented Nov 11, 2019

Uh oh!

xambroz commented Nov 14, 2019

Uh oh!

frenzymadness commented Jan 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

frenzymadness commented Oct 24, 2019 •

edited

Loading