使用python分割大文件

Split large files using python

我在尝试拆分大文件(例如,大约 10GB)时遇到了一些麻烦。基本思想是简单地读取行,并将每行(例如 40000 行)分组到一个文件中。

但是有两种"读取"文件的方式。

1) 第一个是一次读取整个文件,并把它变成一个LIST。但这需要将整个文件加载到内存中,这对于太大的文件来说是很痛苦的。 (我想我以前问过这样的问题)

在 python 中,我试过一次读取整个文件的方法包括:

input1=f.readlines()



input1 = commands.getoutput('zcat ' + file).splitlines(True)



input1 = subprocess.Popen(["cat",file],

               stdout=subprocess.PIPE,bufsize=1)
f=gzip.open(file)

for line in f: blablabla...
for line in fileinput.FileInput(fileName):NUM_OF_LINES=40000

filename = 'myinput.txt'

with open(filename) as fin:

  fout = open("output0.txt","wb")

  for i,line in enumerate(fin):

   fout.write(line)

   if (i+1)%NUM_OF_LINES == 0:

    fout.close()

    fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")



  fout.close()

# assume that an average line is about 80 chars long, and that we want about 

# 40K in each file.



SIZE_HINT = 80 * 40000



fileNumber = 0

with open("inputFile.txt","rt") as f:

 while True:

   buf = f.readlines(SIZE_HINT)

   if not buf:

    # we've read the entire file in, so we're done.

    break

   outFile = open("outFile%d.txt" % fileNumber,"wt")

   outFile.write(buf)

   outFile.close()

   fileNumber += 1
chunk_size = 40000

fout = None

for (i, line) in enumerate(fileinput.FileInput(filename)):

  if i % chunk_size == 0:

    if fout: fout.close()

    fout = open('output%d.txt' % (i/chunk_size), 'w')

  fout.write(line)

fout.close()

from fsplit.filesplit import Filesplit

fs = Filesplit()

def split_cb(f, s):

  print("file: {0}, size: {1}".format(f, s))



fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)

好吧,那么我可以很容易地将 40000 行分组到一个文件中:list[40000,80000] or list[80000,120000]

或者使用列表的好处是我们可以很容易地指向特定的行。

2)第二种方式是逐行读取;读取时处理该行。那些读取的行不会保存在内存中。

示例包括:

input1=f.readlines()



input1 = commands.getoutput('zcat ' + file).splitlines(True)



input1 = subprocess.Popen(["cat",file],

               stdout=subprocess.PIPE,bufsize=1)
f=gzip.open(file)

for line in f: blablabla...
for line in fileinput.FileInput(fileName):NUM_OF_LINES=40000

filename = 'myinput.txt'

with open(filename) as fin:

  fout = open("output0.txt","wb")

  for i,line in enumerate(fin):

   fout.write(line)

   if (i+1)%NUM_OF_LINES == 0:

    fout.close()

    fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")



  fout.close()

# assume that an average line is about 80 chars long, and that we want about 

# 40K in each file.



SIZE_HINT = 80 * 40000



fileNumber = 0

with open("inputFile.txt","rt") as f:

 while True:

   buf = f.readlines(SIZE_HINT)

   if not buf:

    # we've read the entire file in, so we're done.

    break

   outFile = open("outFile%d.txt" % fileNumber,"wt")

   outFile.write(buf)

   outFile.close()

   fileNumber += 1
chunk_size = 40000

fout = None

for (i, line) in enumerate(fileinput.FileInput(filename)):

  if i % chunk_size == 0:

    if fout: fout.close()

    fout = open('output%d.txt' % (i/chunk_size), 'w')

  fout.write(line)

fout.close()

from fsplit.filesplit import Filesplit

fs = Filesplit()

def split_cb(f, s):

  print("file: {0}, size: {1}".format(f, s))



fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)

input1=f.readlines()



input1 = commands.getoutput('zcat ' + file).splitlines(True)



input1 = subprocess.Popen(["cat",file],

               stdout=subprocess.PIPE,bufsize=1)
f=gzip.open(file)

for line in f: blablabla...
for line in fileinput.FileInput(fileName):NUM_OF_LINES=40000

filename = 'myinput.txt'

with open(filename) as fin:

  fout = open("output0.txt","wb")

  for i,line in enumerate(fin):

   fout.write(line)

   if (i+1)%NUM_OF_LINES == 0:

    fout.close()

    fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")



  fout.close()

# assume that an average line is about 80 chars long, and that we want about 

# 40K in each file.



SIZE_HINT = 80 * 40000



fileNumber = 0

with open("inputFile.txt","rt") as f:

 while True:

   buf = f.readlines(SIZE_HINT)

   if not buf:

    # we've read the entire file in, so we're done.

    break

   outFile = open("outFile%d.txt" % fileNumber,"wt")

   outFile.write(buf)

   outFile.close()

   fileNumber += 1
chunk_size = 40000

fout = None

for (i, line) in enumerate(fileinput.FileInput(filename)):

  if i % chunk_size == 0:

    if fout: fout.close()

    fout = open('output%d.txt' % (i/chunk_size), 'w')

  fout.write(line)

fout.close()

from fsplit.filesplit import Filesplit

fs = Filesplit()

def split_cb(f, s):

  print("file: {0}, size: {1}".format(f, s))



fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)

我确定对于 gzip.open,这个 f 不是一个列表,而是一个文件对象。似乎我们只能逐行处理;那么我该如何执行这个"拆分"工作呢?如何指向文件对象的特定行?

谢谢


input1=f.readlines()



input1 = commands.getoutput('zcat ' + file).splitlines(True)



input1 = subprocess.Popen(["cat",file],

               stdout=subprocess.PIPE,bufsize=1)
f=gzip.open(file)

for line in f: blablabla...
for line in fileinput.FileInput(fileName):NUM_OF_LINES=40000

filename = 'myinput.txt'

with open(filename) as fin:

  fout = open("output0.txt","wb")

  for i,line in enumerate(fin):

   fout.write(line)

   if (i+1)%NUM_OF_LINES == 0:

    fout.close()

    fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")



  fout.close()

# assume that an average line is about 80 chars long, and that we want about 

# 40K in each file.



SIZE_HINT = 80 * 40000



fileNumber = 0

with open("inputFile.txt","rt") as f:

 while True:

   buf = f.readlines(SIZE_HINT)

   if not buf:

    # we've read the entire file in, so we're done.

    break

   outFile = open("outFile%d.txt" % fileNumber,"wt")

   outFile.write(buf)

   outFile.close()

   fileNumber += 1
chunk_size = 40000

fout = None

for (i, line) in enumerate(fileinput.FileInput(filename)):

  if i % chunk_size == 0:

    if fout: fout.close()

    fout = open('output%d.txt' % (i/chunk_size), 'w')

  fout.write(line)

fout.close()

from fsplit.filesplit import Filesplit

fs = Filesplit()

def split_cb(f, s):

  print("file: {0}, size: {1}".format(f, s))



fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)

如果每个文件中有特定数量的文件行没有什么特别之处,readlines() 函数还接受一个大小'提示'参数,其行为如下:

If given an optional parameter sizehint, it reads that many bytes from

the file and enough more to complete a line, and returns the lines

from that. This is often used to allow efficient reading of a large

file by lines, but without having to load the entire file in memory.

Only complete lines will be returned.

...所以你可以这样写代码:

input1=f.readlines()



input1 = commands.getoutput('zcat ' + file).splitlines(True)



input1 = subprocess.Popen(["cat",file],

               stdout=subprocess.PIPE,bufsize=1)
f=gzip.open(file)

for line in f: blablabla...
for line in fileinput.FileInput(fileName):NUM_OF_LINES=40000

filename = 'myinput.txt'

with open(filename) as fin:

  fout = open("output0.txt","wb")

  for i,line in enumerate(fin):

   fout.write(line)

   if (i+1)%NUM_OF_LINES == 0:

    fout.close()

    fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")



  fout.close()

# assume that an average line is about 80 chars long, and that we want about 

# 40K in each file.



SIZE_HINT = 80 * 40000



fileNumber = 0

with open("inputFile.txt","rt") as f:

 while True:

   buf = f.readlines(SIZE_HINT)

   if not buf:

    # we've read the entire file in, so we're done.

    break

   outFile = open("outFile%d.txt" % fileNumber,"wt")

   outFile.write(buf)

   outFile.close()

   fileNumber += 1
chunk_size = 40000

fout = None

for (i, line) in enumerate(fileinput.FileInput(filename)):

  if i % chunk_size == 0:

    if fout: fout.close()

    fout = open('output%d.txt' % (i/chunk_size), 'w')

  fout.write(line)

fout.close()

from fsplit.filesplit import Filesplit

fs = Filesplit()

def split_cb(f, s):

  print("file: {0}, size: {1}".format(f, s))



fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)

input1=f.readlines()



input1 = commands.getoutput('zcat ' + file).splitlines(True)



input1 = subprocess.Popen(["cat",file],

               stdout=subprocess.PIPE,bufsize=1)
f=gzip.open(file)

for line in f: blablabla...
for line in fileinput.FileInput(fileName):NUM_OF_LINES=40000

filename = 'myinput.txt'

with open(filename) as fin:

  fout = open("output0.txt","wb")

  for i,line in enumerate(fin):

   fout.write(line)

   if (i+1)%NUM_OF_LINES == 0:

    fout.close()

    fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")



  fout.close()

# assume that an average line is about 80 chars long, and that we want about 

# 40K in each file.



SIZE_HINT = 80 * 40000



fileNumber = 0

with open("inputFile.txt","rt") as f:

 while True:

   buf = f.readlines(SIZE_HINT)

   if not buf:

    # we've read the entire file in, so we're done.

    break

   outFile = open("outFile%d.txt" % fileNumber,"wt")

   outFile.write(buf)

   outFile.close()

   fileNumber += 1
chunk_size = 40000

fout = None

for (i, line) in enumerate(fileinput.FileInput(filename)):

  if i % chunk_size == 0:

    if fout: fout.close()

    fout = open('output%d.txt' % (i/chunk_size), 'w')

  fout.write(line)

fout.close()

from fsplit.filesplit import Filesplit

fs = Filesplit()

def split_cb(f, s):

  print("file: {0}, size: {1}".format(f, s))



fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)

对于一个 10GB 的文件,第二种方法显然是要走的路。以下是您需要做的概要:

  • 打开输入文件。
  • 打开第一个输出文件。
  • 从输入文件中读取一行并将其写入输出文件。
  • 保持计数您已写入当前输出文件的行数;一旦达到 40000,关闭输出文件,然后打开下一个。
  • 重复步骤 3-4,直到到达输入文件的末尾。
  • 关闭这两个文件。

  • 我找到的最佳解决方案是使用库 filesplit (https://pypi.org/project/filesplit/)。

    您只需要指定输入文件、输出文件夹和所需的输出文件大小(以字节为单位)。最后,图书馆将为您完成所有工作。

    input1=f.readlines()
    
    
    
    input1 = commands.getoutput('zcat ' + file).splitlines(True)
    
    
    
    input1 = subprocess.Popen(["cat",file],
    
                   stdout=subprocess.PIPE,bufsize=1)
    f=gzip.open(file)
    
    for line in f: blablabla...
    for line in fileinput.FileInput(fileName):NUM_OF_LINES=40000
    
    filename = 'myinput.txt'
    
    with open(filename) as fin:
    
      fout = open("output0.txt","wb")
    
      for i,line in enumerate(fin):
    
       fout.write(line)
    
       if (i+1)%NUM_OF_LINES == 0:
    
        fout.close()
    
        fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")
    
    
    
      fout.close()
    
    # assume that an average line is about 80 chars long, and that we want about 
    
    # 40K in each file.
    
    
    
    SIZE_HINT = 80 * 40000
    
    
    
    fileNumber = 0
    
    with open("inputFile.txt","rt") as f:
    
     while True:
    
       buf = f.readlines(SIZE_HINT)
    
       if not buf:
    
        # we've read the entire file in, so we're done.
    
        break
    
       outFile = open("outFile%d.txt" % fileNumber,"wt")
    
       outFile.write(buf)
    
       outFile.close()
    
       fileNumber += 1
    chunk_size = 40000
    
    fout = None
    
    for (i, line) in enumerate(fileinput.FileInput(filename)):
    
      if i % chunk_size == 0:
    
        if fout: fout.close()
    
        fout = open('output%d.txt' % (i/chunk_size), 'w')
    
      fout.write(line)
    
    fout.close()
    
    from fsplit.filesplit import Filesplit
    
    fs = Filesplit()
    
    def split_cb(f, s):
    
      print("file: {0}, size: {1}".format(f, s))
    
    
    
    fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)

    显然,当您对文件进行处理时,您将需要以某种方式迭代文件的内容——无论您是手动执行还是让 Python API 的一部分为您执行(例如 readlines () 方法) 并不重要。在大 O 分析中,这意味着您将花费 O(n) 时间(n 是文件的大小)。

    但是将文件读入内存也需要 O(n) 空间。虽然有时我们确实需要将 10 gb 的文件读入内存,但您的特定问题不需要这样做。我们可以直接遍历文件对象。当然,文件对象确实需要空间,但我们没有理由将文件内容以两种不同的形式保存两次。

    因此,我会选择您的第二个解决方案。


相关推荐

  • Spring部署设置openshift

    Springdeploymentsettingsopenshift我有一个问题让我抓狂了三天。我根据OpenShift帐户上的教程部署了spring-eap6-quickstart代码。我已配置调试选项,并且已将Eclipse工作区与OpehShift服务器同步-服务器上的一切工作正常,但在Eclipse中出现无法消除的错误。我有这个错误:cvc-complex-type.2.4.a:Invali…
    2025-04-161
  • 检查Java中正则表达式中模式的第n次出现

    CheckfornthoccurrenceofpatterninregularexpressioninJava本问题已经有最佳答案,请猛点这里访问。我想使用Java正则表达式检查输入字符串中特定模式的第n次出现。你能建议怎么做吗?这应该可以工作:MatchResultfindNthOccurance(intn,Patternp,CharSequencesrc){Matcherm=p.matcher…
    2025-04-161
  • 如何让 JTable 停留在已编辑的单元格上

    HowtohaveJTablestayingontheeditedcell如果有人编辑JTable的单元格内容并按Enter,则内容会被修改并且表格选择会移动到下一行。是否可以禁止JTable在单元格编辑后转到下一行?原因是我的程序使用ListSelectionListener在单元格选择上同步了其他一些小部件,并且我不想在编辑当前单元格后选择下一行。Enter的默认绑定是名为selectNext…
    2025-04-161
  • Weblogic 12c 部署

    Weblogic12cdeploy我正在尝试将我的应用程序从Tomcat迁移到Weblogic12.2.1.3.0。我能够毫无错误地部署应用程序,但我遇到了与持久性提供程序相关的运行时错误。这是堆栈跟踪:javax.validation.ValidationException:CalltoTraversableResolver.isReachable()threwanexceptionatorg.…
    2025-04-161
  • Resteasy Content-Type 默认值

    ResteasyContent-Typedefaults我正在使用Resteasy编写一个可以返回JSON和XML的应用程序,但可以选择默认为XML。这是我的方法:@GET@Path("/content")@Produces({MediaType.APPLICATION_XML,MediaType.APPLICATION_JSON})publicStringcontentListRequestXm…
    2025-04-161
  • 代码不会停止运行,在 Java 中

    thecodedoesn'tstoprunning,inJava我正在用Java解决项目Euler中的问题10,即"Thesumoftheprimesbelow10is2+3+5+7=17.Findthesumofalltheprimesbelowtwomillion."我的代码是packageprojecteuler_1;importjava.math.BigInteger;importjava…
    2025-04-161
  • Out of memory java heap space

    Outofmemoryjavaheapspace我正在尝试将大量文件从服务器发送到多个客户端。当我尝试发送大小为700mb的文件时,它显示了"OutOfMemoryjavaheapspace"错误。我正在使用Netbeans7.1.2版本。我还在属性中尝试了VMoption。但仍然发生同样的错误。我认为阅读整个文件存在一些问题。下面的代码最多可用于300mb。请给我一些建议。提前致谢publicc…
    2025-04-161
  • Log4j 记录到共享日志文件

    Log4jLoggingtoaSharedLogFile有没有办法将log4j日志记录事件写入也被其他应用程序写入的日志文件。其他应用程序可以是非Java应用程序。有什么缺点?锁定问题?格式化?Log4j有一个SocketAppender,它将向服务发送事件,您可以自己实现或使用与Log4j捆绑的简单实现。它还支持syslogd和Windows事件日志,这对于尝试将日志输出与来自非Java应用程序…
    2025-04-161