Pandas 数据框增、删、改、查、去重、抽样基本操作方法

2025-02-26 12:41:46

总括

pandas的索引函数主要有三种：

loc 标签索引，行和列的名称

iloc 整型索引（绝对位置索引），绝对意义上的几行几列，起始索引为0

ix 是 iloc 和 loc的合体

at是loc的快捷方式

iat是iloc的快捷方式

建立测试数据集：

import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c'],'c': ["A","B","C"]})
print(df)
 a b c
0 1 a A
1 2 b B
2 3 c C

行操作

选择某一行

print(df.loc[1,:])
a 2
b b
c B
Name: 1, dtype: object

选择多行

print(df.loc[1:2,:])#选择1:2行，slice为1
 a b c
1 2 b B
2 3 c C
print(df.loc[::-1,:])#选择所有行，slice为-1，所以为倒序
 a b c
2 3 c C
1 2 b B
0 1 a A
print(df.loc[0:2:2,:])#选择0至2行，slice为2，等同于print(df.loc[0:2:2,:])因为只有3行
 a b c
0 1 a A
2 3 c C

条件筛选

普通条件筛选

print(df.loc[:,"a"]>2)#原理是首先做了一个判断，然后再筛选
0 False
1 False
2  True
Name: a, dtype: bool
print(df.loc[df.loc[:,"a"]>2,:])
 a b c
2 3 c C

另外条件筛选还可以集逻辑运算符 | for or, & for and, and ~for not

In [129]: s = pd.Series(range(-3, 4))
In [132]: s[(s < -1) | (s > 0.5)]
Out[132]:
0 -3
1 -2
4 1
5 2
6 3
dtype: int64

isin

非索引列使用isin

In [141]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')
In [143]: s.isin([2, 4, 6])
Out[143]:
4 False
3 False
2  True
1 False
0  True
dtype: bool
In [144]: s[s.isin([2, 4, 6])]
Out[144]:
2 2
0 4
dtype: int64

索引列使用isin

In [145]: s[s.index.isin([2, 4, 6])]
Out[145]:
4 0
2 2
dtype: int64
# compare it to the following
In [146]: s[[2, 4, 6]]
Out[146]:
2 2.0
4 0.0
6 NaN
dtype: float64

结合any()/all()在多列索引时

In [151]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],
 .....:     'ids2': ['a', 'n', 'c', 'n']})
 .....:
In [156]: values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}
In [157]: row_mask = df.isin(values).all(1)
In [158]: df[row_mask]
Out[158]:
 ids ids2 vals
0 a a  1

where()

In [1]: dates = pd.date_range('1/1/2000', periods=8)
In [2]: df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
In [3]: df
Out[3]:
     A   B   C   D
2000-01-01 0.469112 -0.282863 -1.509059 -1.135632
2000-01-02 1.212112 -0.173215 0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929 1.071804
2000-01-04 0.721555 -0.706771 -1.039575 0.271860
2000-01-05 -0.424972 0.567020 0.276232 -1.087401
2000-01-06 -0.673690 0.113648 -1.478427 0.524988
2000-01-07 0.404705 0.577046 -1.715002 -1.039268
2000-01-08 -0.370647 -1.157892 -1.344312 0.844885
In [162]: df.where(df < 0, -df)
Out[162]:
     A   B   C   D
2000-01-01 -2.104139 -1.309525 -0.485855 -0.245166
2000-01-02 -0.352480 -0.390389 -1.192319 -1.655824
2000-01-03 -0.864883 -0.299674 -0.227870 -0.281059
2000-01-04 -0.846958 -1.222082 -0.600705 -1.233203
2000-01-05 -0.669692 -0.605656 -1.169184 -0.342416
2000-01-06 -0.868584 -0.948458 -2.297780 -0.684718
2000-01-07 -2.670153 -0.114722 -0.168904 -0.048048
2000-01-08 -0.801196 -1.392071 -0.048788 -0.808838

DataFrame.where() differs from numpy.where()的区别

In [172]: df.where(df < 0, -df) == np.where(df < 0, df, -df)

当series对象使用where()时，则返回一个序列

In [141]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')
In [159]: s[s > 0]
Out[159]:
3 1
2 2
1 3
0 4
dtype: int64
In [160]: s.where(s > 0)
Out[160]:
4 NaN
3 1.0
2 2.0
1 3.0
0 4.0
dtype: float64

抽样筛选

DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)

当在有权重筛选时，未赋值的列权重为0，如果权重和不为1，则将会将每个权重除以总和。random_state可以设置抽样的种子（seed）。axis可是设置列随机抽样。

In [105]: df2 = pd.DataFrame({'col1':[9,8,7,6], 'weight_column':[0.5, 0.4, 0.1, 0]})
In [106]: df2.sample(n = 3, weights = 'weight_column')
Out[106]:
 col1 weight_column
1  8   0.4
0  9   0.5
2  7   0.1

增加行

df.loc[3,:]=4
  a b c
0 1.0 a A
1 2.0 b B
2 3.0 c C
3 4.0 4 4

插入行

pandas里并没有直接指定索引的插入行的方法，所以要自己设置

line = pd.DataFrame({df.columns[0]:"--",df.columns[1]:"--",df.columns[2]:"--"},index=[1])
df = pd.concat([df.loc[:0],line,df.loc[1:]]).reset_index(drop=True)#df.loc[:0]这里不能写成df.loc[0]，因为df.loc[0]返回的是series
  a b c
0 1.0 a A
1 -- -- --
2 2.0 b B
3 3.0 c C
4 4.0 4 4

交换行

df.loc[[1,2],:]=df.loc[[2,1],:].values
 a b c
0 1 a A
1 3 c C
2 2 b B

删除行

df.drop(0,axis=0,inplace=True)
print(df)
 a b c
1 2 b B
2 3 c C

注意

在以时间作为索引的数据框中，索引是以整形的方式来的。

In [39]: dfl = pd.DataFrame(np.random.randn(5,4), columns=list('ABCD'), index=pd.date_range('20130101',periods=5))
In [40]: dfl
Out[40]:
     A   B   C   D
2013-01-01 1.075770 -0.109050 1.643563 -1.469388
2013-01-02 0.357021 -0.674600 -1.776904 -0.968914
2013-01-03 -1.294524 0.413738 0.276662 -0.472035
2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061
2013-01-05 0.895717 0.805244 -1.206412 2.565646
In [41]: dfl.loc['20130102':'20130104']
Out[41]:
     A   B   C   D
2013-01-02 0.357021 -0.674600 -1.776904 -0.968914
2013-01-03 -1.294524 0.413738 0.276662 -0.472035
2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061

列操作

选择某一列

print(df.loc[:,"a"])
0 1
1 2
2 3
Name: a, dtype: int64

选择多列

print(df.loc[:,"a":"b"])
 a b
0 1 a
1 2 b
2 3 c

增加列,如果对已有的列,则是赋值

df.loc[:,"d"]=4
 a b c d
0 1 a A 4
1 2 b B 4
2 3 c C 4

交换两列的值

df.loc[:,['b', 'a']] = df.loc[:,['a', 'b']].values
print(df)
 a b c
0 a 1 A
1 b 2 B
2 c 3 C

删除列

1）直接del DF[‘column-name']

2）采用drop方法，有下面三种等价的表达式：

DF= DF.drop(‘column_name', 1)；

DF.drop(‘column_name',axis=1, inplace=True)

DF.drop([DF.columns[[0,1,]]], axis=1,inplace=True)

df.drop("a",axis=1,inplace=True)
print(df)
 b c
0 a A
1 b B
2 c C

还有一些其他的功能：

切片df.loc[::,::]

选择随机抽样df.sample()

去重.duplicated()

查询.lookup

以上这篇Pandas 数据框增、删、改、查、去重、抽样基本操作方法就是小编分享给大家的全部内容了，希望能给大家一个参考，也希望大家多多支持我们。

Pandas 数据框增、删、改、查、去重、抽样基本操作方法

总括 pandas的索引函数主要有三种: loc 标签索引,行和列的名称 iloc 整型索引(绝对位置索引),绝对意义上的几行几列,起始索引为0 ix 是 iloc 和 loc的合体 at是loc的快捷方式 iat是iloc的快捷方式建立测试数据集: import pandas as pd df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c'],'c': ["A","B","C"]}) p
简单的php数据库操作类代码(增,删,改,查)

数据库操纵基本流程为: 1.连接数据库服务器 2.选择数据库 3.执行SQL语句 4.处理结果集 5.打印操作信息其中用到的相关函数有 •resource mysql_connect ( [string server [, string username [, string password [, bool new_link [, int client_flags]]]]] ) 连接数据库服务器•resource mysql_pconnect ( [string server [, strin
Struts2实现CRUD(增删改查)功能实例代码

CRUD是Create(创建).Read(读取).Update(更新)和Delete(删除)的缩写,它是普通应用程序的缩影.如果您掌握了某框架的CRUD编写,那么意味可以使用该框架创建普通应用程序了,所以大家使用新框架开发OLTP(Online Transaction Processing)应用程序时,首先会研究一下如何编写CRUD.这类似于大家在学习新编程语言时喜欢编写"Hello World". 本文旨在讲述Struts 2上的CRUD开发,所以为了例子的简单易懂,我不会花时间在数
Python3操作MongoDB增册改查等方法详解

MongoDB是由C++语言编写的非关系型数据库,是一个基于分布式文件存储的开源数据库系统,其内容存储形式类似JSON对象,它的字段值可以包含其他文档.数组及文档数组,非常灵活. 在这一节中,我们就来看看Python 3下MongoDB的存储操作. 1. 准备工作在开始之前,请确保已经安装好了MongoDB并启动了其服务,并且安装好了Python的PyMongo库. 2. 连接MongoDB 连接MongoDB时,我们需要使用PyMongo库里面的MongoClient.一般来说,传入Mong
Python连接SQLite数据库并进行增册改查操作方法详解

SQLite简介 SQLite,是一款轻型的数据库,是遵守ACID的关系型数据库管理系统,它包含在一个相对小的C库中.它是D.RichardHipp建立的公有领域项目.它的设计目标是嵌入式的,而且目前已经在很多嵌入式产品中使用了它,它占用资源非常的低,在嵌入式设备中,可能只需要几百K的内存就够了.它能够支持Windows/Linux/Unix等等主流的操作系统,同时能够跟很多程序语言相结合,比如 Tcl.C#.PHP.Java等,还有ODBC接口,同样比起Mysql.PostgreSQL这两款开
pandas数据框,统计某列数据对应的个数方法

现在要解决的问题如下: 我们有一个数据的表第7列有许多数字,并且是用逗号分隔的,数字又有一个对应的关系: 我们要得到第7列对应关系的统计,就是每一行的第7列a有多少个,b有多少个好了,我给的解决方法如下: #!/bin/python #-*-coding:UTF-8-*- import pandas as pd import numpy as np dfidspec = pd.read_table("one.txt")#这个是对应关系的文件 dfmgs = pd.read_tabl
Pandas数据分析-pandas数据框的多层索引

目录前言创建多层索引多层索引操作索引名称的查看索引的层级索引内容的查看数据查询数据分组前言 pandas数据框针对高维数据,也有多层索引的办法去应对.多层数据一般长这个样子可以看到AB两大列,下面又有xy两小列. 行有abc三行,又分为onetwo两小行. 在分组聚合的时候也会产生多层索引,下面演示一下. 导入包和数据: import numpy as np import pandas as pd df=pd.read_excel('team.xlsx') 分组聚合: df.
读Json文件生成pandas数据框详情

目录前言 records格式 index格式 columns 类型 values格式 split 参数示例压缩与编码前言本文讲解如何加载json文件或字符串为pandas数据框.pandas把json数据分成几种典型类型,希望对你实际数据应用开发有所启示. 有时可能需要转换json文件位pandas数据框.使用pandas内置的read_json()函数很容易实现, 其语法如下: read_json(‘path’, orient=’index’) path: json文件的路径 orie
oracle监控某表变动触发器例子(监控增,删,改)

使用oracle触发器实现对某个表的增改删的监控操作,并记录到另一个表中. 代码: 复制代码代码如下: create or replace trigger test_trigger before insert or update or delete on test_table for each row declare v_id varchar2(30); v_bdlb varchar2(1); v_jgdm VARCHAR2(
使用DataTable更新数据库(增,删,改)

1.修改数据复制代码代码如下: DataRow dr = hRDataSet.Tables["emp"].Rows.Find(textBox3.Text); //DataRow dr = hRDataSet.Tables["emp"].Select("id="+textBox3.Text)[0]; dr.BeginEdit(); dr["name"] = t

Pandas 数据框增、删、改、查、去重、抽样基本操作方法

相关推荐

随机推荐