Fix filename encoding for zip archive with Python

Sometimes after unzipping some ZIP archives you get from the Internet, you get directories and files with mojibake names. The names in a ZIP file are bytes without any encoding information, and it is up to the unzipping tool to decide which encoding the names use.

According to the python zipfile module, typical ZIP tools interprets filenames as encoded in CP437, if they cannot decide the original encoding.

The following demostrates the mojibake result with a ZIP archive encoded in GBK for filenames. The archive is generated by Baidu Wangpan, which enforces GBK. They might do it for compatibilities with legacy Chinese system, but I think they should really enforce UTF-8 instead of GBK.

from zipfile import ZipFile

zname = '【批量下载】第一节 概念与性质等.zip'
zf = ZipFile(zname)
zf.namelist()

The tool cannot decode the filename correctly so it interprets them as in CP437. To get the original bytes in Python 3, just re-encode them in CP437. With those bytes, you can get correctly decoded filename by decoding them in GBK.

[s.encode('cp437').decode('gbk') for s in zf.namelist()]

The first bad news is that you cannot decide the original encoding beforehand without any guesswork. I guess by the language of the content and it works reasonably.

Another bad news is that you cannot update the ZIP archive in-place, even the famous Info-Zip makes a temporary copy while renaming files.

from copy import copy

LANG_FLAG = 0x800               # bit 11 indicates UTF-8 for filenames
OS_FLAG = 3                     # 3 represents UNIX
FILEMODE = 0o100664             # filemode byte for -rw-rw-r--

with ZipFile('fixed.zip', mode='w') as ztf:
    ztf.comment = zf.comment
    for zinfo in zf.infolist():
        zinfo.CRC = None
        ztinfo = copy(zinfo)
        ztinfo.filename = zinfo.filename.encode('cp437').decode('gbk')
        ztinfo.flag_bits |= LANG_FLAG
        ztinfo.create_system = OS_FLAG
        ztinfo.external_attr = FILEMODE << 16
        ztf.writestr(ztinfo, zf.read(zinfo))

zf.close()
unzip -l fixed.zip
Archive:  fixed.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
   347005  2015-06-20 01:40   微积分B(1)/第11周讲课提纲/第一节 概念与性质.pdf
      387  2015-06-20 01:40   微积分B(1)/第11周讲课提纲/userCommands.tex
   241502  2015-06-20 01:40   微积分B(1)/第11周讲课提纲/第二节 换元积分法.pdf
   203684  2015-06-20 01:40   微积分B(1)/第11周讲课提纲/第三节 分部积分法.pdf
     6041  2015-06-20 01:40   微积分B(1)/第11周讲课提纲/第二节 换元积分法.tex
     3123  2015-06-20 01:40   微积分B(1)/第11周讲课提纲/第三节 分部积分法.tex
     8972  2015-06-20 01:40   微积分B(1)/第11周讲课提纲/第一节 概念与性质.tex
      176  2015-06-20 01:40   微积分B(1)/第11周讲课提纲/config.tex
---------                     -------
   810890                     8 files

Some codes invite explanation, please refer to ZIP File Format specification.

Author: Lei Zhao

Updated: 2017-11-27 Mon 10:16

Validate