1、常规操作-numpy-pandas

<p>[TOC]</p> <h2>numpy</h2> <h3>numpy设置输出数字的精度，以及不使用科学计数法</h3> <pre><code># 小数点精度 np.set_printoptions(precision=3) # 不使用科学计数法 np.set_printoptions(suppress=True) import numpy as np x=np.random.random(10) print(x) # [ 0.07837821 0.48002108 0.41274116 0.82993414 0.77610352 0.1023732 # 0.51303098 0.4617183 0.33487207 0.71162095] np.set_printoptions(precision=3) print(x) # [ 0.078 0.48 0.413 0.83 0.776 0.102 0.513 0.462 0.335 0.712] --------------------- y=np.array([1.5e-10,1.5,1500]) print(y) # [ 1.500e-10 1.500e+00 1.500e+03] np.set_printoptions(suppress=True) print(y) # [ 0. 1.5 1500. ] </code></pre> <h3>numpy 筛选过滤</h3> <p>参考： <a href="https://www.runoob.com/numpy/numpy-matplotlib.html">https://www.runoob.com/numpy/numpy-matplotlib.html</a> <a href="https://blog.csdn.net/liangzuojiayi/article/details/51547950">https://blog.csdn.net/liangzuojiayi/article/details/51547950</a> <a href="https://www.cnblogs.com/xn5991/p/9526267.html">https://www.cnblogs.com/xn5991/p/9526267.html</a></p> <pre><code>筛选数目 (num_np < 5000).sum() # 筛选范围内的数据 num_np[(num_np < 10000)] array([69242, 11265, 10972, 51333, 10677, 10116, 10263, 45568, 42024, 10325, 44767, 11238, 21586, 19623, 10291, 12331, 10898, 11977, 23928, 44700, 26890, 58326, 58443, 35057, 19791, 10379, 15853, 14560, 59957, 20087, 61944, 55273, 32915, 19264, 10524, 53171, 12676, 10110, 13473, 10099, 10910, 10344, 29827, 15077, 10708, 21881, 51478, 57611, 12668, 15099, 41918, 47593, 14701, 40244, 63219, 61217, 65727, 37542, 21543, 75385, 10535, 26085, 54677, 10547, 49567, 12757, 71151, 72843, 57292, 10001, 35191, 73848, 60414, 32646, 63310, 11037, 41040, 62953, 11657, 49475, 15879, 12660, 14338, 14191, 10888, 11731, 57730, 59461, 63470, 10692, 15153, 16774, 42224, 53700, 13077, 13750, 14369, 11642, 10559, 14141, 10346, 51544, 16882, 53320, 11767, 13514, 31132, 54718, 11626, 11318, 73122, 60632, 10048, 12965, 27150, 17459, 15472, 12895, 22079, 16574, 21092, 10225, 14541, 67910, 11063, 42012, 10723, 10427, 23296, 80023, 12355, 10831, 23355, 11830, 19555, 11083, 41854, 15018, 11767, 12901, 11387, 13705, 22698, 10055, 67356])</code></pre> <h2>matplotlib 画图 - 文章分布</h2> <pre><code>from matplotlib import pyplot as plt import matplotlib # from pylab import * # figure(figsize=(8,6), dpi=80) y = np_w x = np.arange(0,len(np_w)) plt.title("文章字数分布",fontproperties=zhfont1) plt.xlabel("文章序号",fontproperties=zhfont1) plt.ylabel("文章字数",fontproperties=zhfont1) plt.plot(x,y) plt.show()</code></pre> <h2>pandas</h2> <h3>保存为json文件 -分行 -不分行</h3> <pre><code> pred_answers = [] # 一个列表，分行保存 with open(result_file, 'w') as fout: for pred_answer in pred_answers: fout.write(json.dumps(pred_answer, ensure_ascii=False) + '\n')</code></pre> <ul> <li> <p>读取pandas 保存json</p> <pre><code>import pandas as pd import json file_name = 'train_flag_croups_jieba' df = pd.read_json('../train/' + file_name + '.json') with open('./' + file_name + '_' + '.json', 'w') as fout: for record in df.to_dict(orient='records'): fout.write(json.dumps(record, ensure_ascii=False) + '\n')</code></pre> </li> <li> <p>直接保存一行</p> <pre><code># orient='records' 每行作为一条记录，否则是每列为一条记录 with open('./' + file_name + '_' + '.json', 'w') as fout: fout.write(json.dumps(df.to_dict(orient='records'), ensure_ascii=False) + '\n')</code></pre> </li> </ul> <h3>pandas读取大文件</h3> <p>0.分行读取</p> <pre><code>data = [] # encoding='gbk', with open('../test/test_croups_jieba.json','r',errors='ignore') as f: for i,line in enumerate(f): data.append(line) if i > 100: break data = pd.DataFrame(data[0:100])</code></pre> <p>1.读取限定列</p> <p>一个CSV文件中，往往有很多不同的列，而我们通常只关注其中的某些列，如果把每行都读取出来，再提取信息，显然会增加IO量，因此我们可以在读文件的时候，定给read_csv()方法的参数，从而提高效率。</p> <pre><code>file = pd.read_csv('demo.csv',usecols=['column1', 'column2', 'column3'])</code></pre> <p>在usecols参数中，给定了要读取的3列，文件中则只包含这3列的信息。</p> <p>2.读取限定行</p> <p>实际写代码的过程中，往往需要先跑一部分数据进行测试，测试通过后，再处理所有的数据。也可能我们仅仅需要一部分数据进行运算。时候这就可以使用read_csv()方法中的nrows参数，设定读取的行数。</p> <pre><code>file = pd.read_csv('demo.csv',nrows=1000,usecols=['column1', 'column2', 'column3'])</code></pre> <p>仅读取前1000行数据。</p> <p>3.分块读取</p> <p>read_csv()方法中还有一个参数，chunksize可以指定一个CHUNKSIZE分块大小来读取文件。与直接使用DF进行遍历不同的是，查看报道的它英文的一个TextFileReader类型的对象。</p> <pre><code>reader = pd.read_csv('demo.csv',nrows=10000, usecols=['column1','column2','column3'], chunksize=1000,iterator=True) reader 输出： <pandas.io.parsers.TextFileReader at 0x120d2f290></code></pre> <p>4.其他</p> <p>头（）和尾部（）</p> <p>拿到一个很大的CSV文件后，为了看清文件的格式，可以使用该方法，先查看前10条数据。头（）方法默认是10条，也可以用尾（）方法查看最后10条数据。</p> <pre><code>file = pd.read_csv('demo.csv') df = pd.DataFrame(file) df.head() df.tail()</code></pre> <p>目前用到的就是这些，之后用到再补充。</p> <p>拿到数据之后，分析数据之间的逻辑，建立相应的能够表示数据间逻辑关系的数据结构，再进行相应的处理。</p> <p>原文链接： <a href="https://blog.csdn.net/ninnyyan/article/details/80999378">https://blog.csdn.net/ninnyyan/article/details/80999378</a> <a href="https://blog.csdn.net/wld914674505/article/details/81431128">https://blog.csdn.net/wld914674505/article/details/81431128</a></p> <h3>筛选某一行</h3> <pre><code># 筛选行号 data_pd.loc[1] 【out】 article_content [法国航宇防务网站2009年5月13日报道] 美国海军网络战司令部最近将海军电子战技术一体... article_id 40693 article_title 美国海军创建电子战中心加强战斗效率 article_type 防务快讯 questions [{'questions_id': '89c72cb5-0c37-4dd9-a619-5b3... Name: 1, dtype: object # 筛选区间 print(data_pd[1:2])</code></pre> <p>-</p> <h3>筛选列</h3> <pre><code>data_pd['questions'][0] data_pd['questions']</code></pre> <h3>输出基本信息</h3> <pre><code>data_pd.info</code></pre> <h3>基本操作</h3> <pre><code>import pandas as pd import os os.getcwd() os.chdir('/Users/zcr/work/PycharmProjects/project/nlp/sentense_class') os.getcwd() neg = pd.read_csv("neg.csv", encoding="utf-8") neg.head(10) neg.columns = ['content'] neg.head(5) neg['label'] = 0 neutral = pd.read_csv("neutral.csv", encoding="utf-8",error_bad_lines=False) neutral.head(5) neutral.columns = ['content'] neutral['label'] = 2 neutral.head(5) # 拼接 all = pd.concat([neg,neutral]) # 打乱 df = all.sample(frac=1.0) # 保存 df.to_csv('all.csv',index=0) #不保存行索引 </code></pre>

python

1、常规操作-numpy-pandas

页面列表