使用python分割大文件
Split large files using python
我在尝试拆分大文件(例如,大约 10GB)时遇到了一些麻烦。基本思想是简单地读取行,并将每行(例如 40000 行)分组到一个文件中。
但是有两种"读取"文件的方式。
1) 第一个是一次读取整个文件,并把它变成一个LIST。但这需要将整个文件加载到内存中,这对于太大的文件来说是很痛苦的。 (我想我以前问过这样的问题)
在 python 中,我试过一次读取整个文件的方法包括:
input1=f.readlines()
input1 = commands.getoutput('zcat ' + file).splitlines(True)
input1 = subprocess.Popen(["cat",file],
stdout=subprocess.PIPE,bufsize=1)
f=gzip.open(file)
for line in f: blablabla...
for line in fileinput.FileInput(fileName):NUM_OF_LINES=40000
filename = 'myinput.txt'
with open(filename) as fin:
fout = open("output0.txt","wb")
for i,line in enumerate(fin):
fout.write(line)
if (i+1)%NUM_OF_LINES == 0:
fout.close()
fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")
fout.close()
# assume that an average line is about 80 chars long, and that we want about
# 40K in each file.
SIZE_HINT = 80 * 40000
fileNumber = 0
with open("inputFile.txt","rt") as f:
while True:
buf = f.readlines(SIZE_HINT)
if not buf:
# we've read the entire file in, so we're done.
break
outFile = open("outFile%d.txt" % fileNumber,"wt")
outFile.write(buf)
outFile.close()
fileNumber += 1
chunk_size = 40000
fout = None
for (i, line) in enumerate(fileinput.FileInput(filename)):
if i % chunk_size == 0:
if fout: fout.close()
fout = open('output%d.txt' % (i/chunk_size), 'w')
fout.write(line)
fout.close()
from fsplit.filesplit import Filesplit
fs = Filesplit()
def split_cb(f, s):
print("file: {0}, size: {1}".format(f, s))
fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)
好吧,那么我可以很容易地将 40000 行分组到一个文件中:list[40000,80000] or list[80000,120000]
或者使用列表的好处是我们可以很容易地指向特定的行。
2)第二种方式是逐行读取;读取时处理该行。那些读取的行不会保存在内存中。
示例包括:
input1=f.readlines()
input1 = commands.getoutput('zcat ' + file).splitlines(True)
input1 = subprocess.Popen(["cat",file],
stdout=subprocess.PIPE,bufsize=1)
f=gzip.open(file)
for line in f: blablabla...
for line in fileinput.FileInput(fileName):NUM_OF_LINES=40000
filename = 'myinput.txt'
with open(filename) as fin:
fout = open("output0.txt","wb")
for i,line in enumerate(fin):
fout.write(line)
if (i+1)%NUM_OF_LINES == 0:
fout.close()
fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")
fout.close()
# assume that an average line is about 80 chars long, and that we want about
# 40K in each file.
SIZE_HINT = 80 * 40000
fileNumber = 0
with open("inputFile.txt","rt") as f:
while True:
buf = f.readlines(SIZE_HINT)
if not buf:
# we've read the entire file in, so we're done.
break
outFile = open("outFile%d.txt" % fileNumber,"wt")
outFile.write(buf)
outFile.close()
fileNumber += 1
chunk_size = 40000
fout = None
for (i, line) in enumerate(fileinput.FileInput(filename)):
if i % chunk_size == 0:
if fout: fout.close()
fout = open('output%d.txt' % (i/chunk_size), 'w')
fout.write(line)
fout.close()
from fsplit.filesplit import Filesplit
fs = Filesplit()
def split_cb(f, s):
print("file: {0}, size: {1}".format(f, s))
fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)
或
input1=f.readlines()
input1 = commands.getoutput('zcat ' + file).splitlines(True)
input1 = subprocess.Popen(["cat",file],
stdout=subprocess.PIPE,bufsize=1)
f=gzip.open(file)
for line in f: blablabla...
for line in fileinput.FileInput(fileName):NUM_OF_LINES=40000
filename = 'myinput.txt'
with open(filename) as fin:
fout = open("output0.txt","wb")
for i,line in enumerate(fin):
fout.write(line)
if (i+1)%NUM_OF_LINES == 0:
fout.close()
fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")
fout.close()
# assume that an average line is about 80 chars long, and that we want about
# 40K in each file.
SIZE_HINT = 80 * 40000
fileNumber = 0
with open("inputFile.txt","rt") as f:
while True:
buf = f.readlines(SIZE_HINT)
if not buf:
# we've read the entire file in, so we're done.
break
outFile = open("outFile%d.txt" % fileNumber,"wt")
outFile.write(buf)
outFile.close()
fileNumber += 1
chunk_size = 40000
fout = None
for (i, line) in enumerate(fileinput.FileInput(filename)):
if i % chunk_size == 0:
if fout: fout.close()
fout = open('output%d.txt' % (i/chunk_size), 'w')
fout.write(line)
fout.close()
from fsplit.filesplit import Filesplit
fs = Filesplit()
def split_cb(f, s):
print("file: {0}, size: {1}".format(f, s))
fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)
我确定对于 gzip.open,这个 f 不是一个列表,而是一个文件对象。似乎我们只能逐行处理;那么我该如何执行这个"拆分"工作呢?如何指向文件对象的特定行?
谢谢
input1=f.readlines()
input1 = commands.getoutput('zcat ' + file).splitlines(True)
input1 = subprocess.Popen(["cat",file],
stdout=subprocess.PIPE,bufsize=1)
f=gzip.open(file)
for line in f: blablabla...
for line in fileinput.FileInput(fileName):NUM_OF_LINES=40000
filename = 'myinput.txt'
with open(filename) as fin:
fout = open("output0.txt","wb")
for i,line in enumerate(fin):
fout.write(line)
if (i+1)%NUM_OF_LINES == 0:
fout.close()
fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")
fout.close()
# assume that an average line is about 80 chars long, and that we want about
# 40K in each file.
SIZE_HINT = 80 * 40000
fileNumber = 0
with open("inputFile.txt","rt") as f:
while True:
buf = f.readlines(SIZE_HINT)
if not buf:
# we've read the entire file in, so we're done.
break
outFile = open("outFile%d.txt" % fileNumber,"wt")
outFile.write(buf)
outFile.close()
fileNumber += 1
chunk_size = 40000
fout = None
for (i, line) in enumerate(fileinput.FileInput(filename)):
if i % chunk_size == 0:
if fout: fout.close()
fout = open('output%d.txt' % (i/chunk_size), 'w')
fout.write(line)
fout.close()
from fsplit.filesplit import Filesplit
fs = Filesplit()
def split_cb(f, s):
print("file: {0}, size: {1}".format(f, s))
fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)
如果每个文件中有特定数量的文件行没有什么特别之处,readlines() 函数还接受一个大小'提示'参数,其行为如下:
If given an optional parameter sizehint, it reads that many bytes from
the file and enough more to complete a line, and returns the lines
from that. This is often used to allow efficient reading of a large
file by lines, but without having to load the entire file in memory.
Only complete lines will be returned.
...所以你可以这样写代码:
input1=f.readlines()
input1 = commands.getoutput('zcat ' + file).splitlines(True)
input1 = subprocess.Popen(["cat",file],
stdout=subprocess.PIPE,bufsize=1)
f=gzip.open(file)
for line in f: blablabla...
for line in fileinput.FileInput(fileName):NUM_OF_LINES=40000
filename = 'myinput.txt'
with open(filename) as fin:
fout = open("output0.txt","wb")
for i,line in enumerate(fin):
fout.write(line)
if (i+1)%NUM_OF_LINES == 0:
fout.close()
fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")
fout.close()
# assume that an average line is about 80 chars long, and that we want about
# 40K in each file.
SIZE_HINT = 80 * 40000
fileNumber = 0
with open("inputFile.txt","rt") as f:
while True:
buf = f.readlines(SIZE_HINT)
if not buf:
# we've read the entire file in, so we're done.
break
outFile = open("outFile%d.txt" % fileNumber,"wt")
outFile.write(buf)
outFile.close()
fileNumber += 1
chunk_size = 40000
fout = None
for (i, line) in enumerate(fileinput.FileInput(filename)):
if i % chunk_size == 0:
if fout: fout.close()
fout = open('output%d.txt' % (i/chunk_size), 'w')
fout.write(line)
fout.close()
from fsplit.filesplit import Filesplit
fs = Filesplit()
def split_cb(f, s):
print("file: {0}, size: {1}".format(f, s))
fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)
input1=f.readlines()
input1 = commands.getoutput('zcat ' + file).splitlines(True)
input1 = subprocess.Popen(["cat",file],
stdout=subprocess.PIPE,bufsize=1)
f=gzip.open(file)
for line in f: blablabla...
for line in fileinput.FileInput(fileName):NUM_OF_LINES=40000
filename = 'myinput.txt'
with open(filename) as fin:
fout = open("output0.txt","wb")
for i,line in enumerate(fin):
fout.write(line)
if (i+1)%NUM_OF_LINES == 0:
fout.close()
fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")
fout.close()
# assume that an average line is about 80 chars long, and that we want about
# 40K in each file.
SIZE_HINT = 80 * 40000
fileNumber = 0
with open("inputFile.txt","rt") as f:
while True:
buf = f.readlines(SIZE_HINT)
if not buf:
# we've read the entire file in, so we're done.
break
outFile = open("outFile%d.txt" % fileNumber,"wt")
outFile.write(buf)
outFile.close()
fileNumber += 1
chunk_size = 40000
fout = None
for (i, line) in enumerate(fileinput.FileInput(filename)):
if i % chunk_size == 0:
if fout: fout.close()
fout = open('output%d.txt' % (i/chunk_size), 'w')
fout.write(line)
fout.close()
from fsplit.filesplit import Filesplit
fs = Filesplit()
def split_cb(f, s):
print("file: {0}, size: {1}".format(f, s))
fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)
对于一个 10GB 的文件,第二种方法显然是要走的路。以下是您需要做的概要:
- 打开输入文件。
- 打开第一个输出文件。
- 从输入文件中读取一行并将其写入输出文件。
- 保持计数您已写入当前输出文件的行数;一旦达到 40000,关闭输出文件,然后打开下一个。
- 重复步骤 3-4,直到到达输入文件的末尾。
- 关闭这两个文件。
我找到的最佳解决方案是使用库 filesplit (https://pypi.org/project/filesplit/)。
您只需要指定输入文件、输出文件夹和所需的输出文件大小(以字节为单位)。最后,图书馆将为您完成所有工作。
input1=f.readlines()
input1 = commands.getoutput('zcat ' + file).splitlines(True)
input1 = subprocess.Popen(["cat",file],
stdout=subprocess.PIPE,bufsize=1)
f=gzip.open(file)
for line in f: blablabla...
for line in fileinput.FileInput(fileName):NUM_OF_LINES=40000
filename = 'myinput.txt'
with open(filename) as fin:
fout = open("output0.txt","wb")
for i,line in enumerate(fin):
fout.write(line)
if (i+1)%NUM_OF_LINES == 0:
fout.close()
fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")
fout.close()
# assume that an average line is about 80 chars long, and that we want about
# 40K in each file.
SIZE_HINT = 80 * 40000
fileNumber = 0
with open("inputFile.txt","rt") as f:
while True:
buf = f.readlines(SIZE_HINT)
if not buf:
# we've read the entire file in, so we're done.
break
outFile = open("outFile%d.txt" % fileNumber,"wt")
outFile.write(buf)
outFile.close()
fileNumber += 1
chunk_size = 40000
fout = None
for (i, line) in enumerate(fileinput.FileInput(filename)):
if i % chunk_size == 0:
if fout: fout.close()
fout = open('output%d.txt' % (i/chunk_size), 'w')
fout.write(line)
fout.close()
from fsplit.filesplit import Filesplit
fs = Filesplit()
def split_cb(f, s):
print("file: {0}, size: {1}".format(f, s))
fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)
显然,当您对文件进行处理时,您将需要以某种方式迭代文件的内容——无论您是手动执行还是让 Python API 的一部分为您执行(例如 readlines () 方法) 并不重要。在大 O 分析中,这意味着您将花费 O(n) 时间(n 是文件的大小)。
但是将文件读入内存也需要 O(n) 空间。虽然有时我们确实需要将 10 gb 的文件读入内存,但您的特定问题不需要这样做。我们可以直接遍历文件对象。当然,文件对象确实需要空间,但我们没有理由将文件内容以两种不同的形式保存两次。
因此,我会选择您的第二个解决方案。