Fix filename encoding for zip archive with Python
Sometimes after unzipping some ZIP archives you get from the Internet, you get directories and files with mojibake names. The names in a ZIP file are bytes without any encoding information, and it is up to the unzipping tool to decide which encoding the names use.
According to the python zipfile module, typical ZIP tools interprets filenames as encoded in CP437, if they cannot decide the original encoding.
The following demostrates the mojibake result with a ZIP archive encoded in GBK for filenames. The archive is generated by Baidu Wangpan, which enforces GBK. They might do it for compatibilities with legacy Chinese system, but I think they should really enforce UTF-8 instead of GBK.
from zipfile import ZipFile zname = '【批量下载】第一节 概念与性质等.zip' zf = ZipFile(zname)
zf.namelist()
- ╬ó╗²╖╓B(1)/╡┌11╓▄╜▓┐╬╠ß╕┘/╡┌╥╗╜┌ ╕┼─ε╙δ╨╘╓╩.pdf
- ╬ó╗²╖╓B(1)/╡┌11╓▄╜▓┐╬╠ß╕┘/userCommands.tex
- ╬ó╗²╖╓B(1)/╡┌11╓▄╜▓┐╬╠ß╕┘/╡┌╢■╜┌ ╗╗╘¬╗²╖╓╖¿.pdf
- ╬ó╗²╖╓B(1)/╡┌11╓▄╜▓┐╬╠ß╕┘/╡┌╚²╜┌ ╖╓▓┐╗²╖╓╖¿.pdf
- ╬ó╗²╖╓B(1)/╡┌11╓▄╜▓┐╬╠ß╕┘/╡┌╢■╜┌ ╗╗╘¬╗²╖╓╖¿.tex
- ╬ó╗²╖╓B(1)/╡┌11╓▄╜▓┐╬╠ß╕┘/╡┌╚²╜┌ ╖╓▓┐╗²╖╓╖¿.tex
- ╬ó╗²╖╓B(1)/╡┌11╓▄╜▓┐╬╠ß╕┘/╡┌╥╗╜┌ ╕┼─ε╙δ╨╘╓╩.tex
- ╬ó╗²╖╓B(1)/╡┌11╓▄╜▓┐╬╠ß╕┘/config.tex
The tool cannot decode the filename correctly so it interprets them as in CP437. To get the original bytes in Python 3, just re-encode them in CP437. With those bytes, you can get correctly decoded filename by decoding them in GBK.
[s.encode('cp437').decode('gbk') for s in zf.namelist()]
- 微积分B(1)/第11周讲课提纲/第一节 概念与性质.pdf
- 微积分B(1)/第11周讲课提纲/userCommands.tex
- 微积分B(1)/第11周讲课提纲/第二节 换元积分法.pdf
- 微积分B(1)/第11周讲课提纲/第三节 分部积分法.pdf
- 微积分B(1)/第11周讲课提纲/第二节 换元积分法.tex
- 微积分B(1)/第11周讲课提纲/第三节 分部积分法.tex
- 微积分B(1)/第11周讲课提纲/第一节 概念与性质.tex
- 微积分B(1)/第11周讲课提纲/config.tex
The first bad news is that you cannot decide the original encoding beforehand without any guesswork. I guess by the language of the content and it works reasonably.
Another bad news is that you cannot update the ZIP archive in-place, even the famous Info-Zip makes a temporary copy while renaming files.
from copy import copy LANG_FLAG = 0x800 # bit 11 indicates UTF-8 for filenames OS_FLAG = 3 # 3 represents UNIX FILEMODE = 0o100664 # filemode byte for -rw-rw-r-- with ZipFile('fixed.zip', mode='w') as ztf: ztf.comment = zf.comment for zinfo in zf.infolist(): zinfo.CRC = None ztinfo = copy(zinfo) ztinfo.filename = zinfo.filename.encode('cp437').decode('gbk') ztinfo.flag_bits |= LANG_FLAG ztinfo.create_system = OS_FLAG ztinfo.external_attr = FILEMODE << 16 ztf.writestr(ztinfo, zf.read(zinfo)) zf.close()
unzip -l fixed.zip
Archive: fixed.zip Length Date Time Name --------- ---------- ----- ---- 347005 2015-06-20 01:40 微积分B(1)/第11周讲课提纲/第一节 概念与性质.pdf 387 2015-06-20 01:40 微积分B(1)/第11周讲课提纲/userCommands.tex 241502 2015-06-20 01:40 微积分B(1)/第11周讲课提纲/第二节 换元积分法.pdf 203684 2015-06-20 01:40 微积分B(1)/第11周讲课提纲/第三节 分部积分法.pdf 6041 2015-06-20 01:40 微积分B(1)/第11周讲课提纲/第二节 换元积分法.tex 3123 2015-06-20 01:40 微积分B(1)/第11周讲课提纲/第三节 分部积分法.tex 8972 2015-06-20 01:40 微积分B(1)/第11周讲课提纲/第一节 概念与性质.tex 176 2015-06-20 01:40 微积分B(1)/第11周讲课提纲/config.tex --------- ------- 810890 8 files
Some codes invite explanation, please refer to ZIP File Format specification.