#1、抽取转换大集中及金三2个税明细数据到CSV文件 #2、生成大集中金三税务登记对照表到CSV文件 #3、抽取合并企业所得税年报到CSV文件 #4、抽取各种代码表到EXCEL文件 #5、生成代缴单位国地税税务登记对照表 #数据上传到阿里云,用于阿里云个税大数据分析实例 # add R library path in addition to default .libPaths( c( .libPaths(), "C:/Program Files/R/R-3.2.3/library") ) #load libraries used library(DBI) library(ROracle) library(sqldf) library(openxlsx) library(digest) #装入工作区 load("d:/temp/GeShuiDaShuJu.RData") ## 建立到ORACLE的连接 ora <- Oracle() #大集中下发库 con1 <- dbConnect(ora, username = "*********", password = "********", dbname = "ZHSJXF") #金三2下发库 con2 <- dbConnect(ora, username = "********", password = "*********", dbname = "JSXF") #path to store data path<-c("d:/temp/data") #每次抽取转换1万条个税明细记录 chunk<-10000 #将身份证号码等转换成MD5摘要,云上公网环境不应该暴露身份证号码 digestMD5<-function(inData,field,fieldMD5){ for(i in 1:length(inData[,1])){ inData[i,fieldMD5]<- digest(inData[i,field], serialize=FALSE) } return(inData) } #用于读取金三及大集中个税纳税个人的名单 readSfzm<-function(con=NULL,sql=NUll){ t1<-proc.time() rs <- dbSendQuery(con, sql) readMore<-TRUE i<-0 records<-0 # read all chunks into a csv file while(readMore){ i<-i+1 temp <- fetch(rs, n = chunk) #脱敏处理,身份证号码转换为MD5摘要值,删除身份证号码与姓名 temp$MD5SFZJHM<-"" temp1<-digestMD5(temp,"SFZJHM","MD5SFZJHM") if (length(temp[,1])CSV ora2csv<-function(con=NULL,chunk=100000,filepath=NULL,filename=NULL,sql=NULL,yy=NULL,verbose=FALSE){ # con: connection object to Oracle # chunk: rows to read for each block # filepath: file path for output temp CSV file # sql: sql statement to query # verbose: if print verbose message # colCls: column names and types need to point out clearly, # some columns would cause error if not declared clearly when imported from CSV if(is.null(filepath)) filepath<-c("d:/temp/data") if(is.null(filename)) filename<-paste("gsmx20",as.character(as.numeric(Sys.time())),".csv",sep="") outFn<-paste(filepath,"/",filename,sep="") # unable to query, return NULL if(is.null(con)||is.null(sql)) return(NULL) if(file.exists(outFn)) unlink(outFn) if(verbose){ cat("row numbers per block:",as.character(chunk),"\n") cat("sql:",sql,"\n") } # perform the query and record time elapsed t1<-proc.time() rs <- dbSendQuery(con, sql) readMore<-TRUE i<-0 records<-0 # read all chunks into a csv file while(readMore){ i<-i+1 temp <- fetch(rs, n = chunk) #读进来后要在这里做转换处理 #所得项目转换 temp1<-merge(temp,sdxm_dz,by=c("ZSPM_DM","SDXM_DM"),all.x = TRUE) temp1<-temp1[,c(3:6,1,2,18,7:17)] temp1$SDXM_DM<-NULL colnames(temp1)[6]<-c("SDXM_DM") #征收品目转换 temp1<-merge(temp1,zspm_dz,by="ZSPM_DM",all.x = TRUE) temp1$ZSPM_DM<-NULL temp1<-temp1[,c(1:4,17,5:16)] colnames(temp1)[5]<-c("ZSPM_DM") #身份证件类型转换 temp1<-merge(temp1,sfzjlx_dz,by=c("SFZJLX_DM"),all.x = TRUE) temp1$SFZJLX_DM<-NULL temp1<-temp1[,c(1,17,2:16)] colnames(temp1)[2]<-c("SFZJLX_DM") #脱敏处理,身份证号码转换为MD5摘要值,删除身份证号码与姓名 temp1$MD5SFZJHM<-"" temp1<-digestMD5(temp1,"SFZJHM","MD5SFZJHM") temp1$SFZJHM<-temp1$XM<-NULL #调整列次序 temp1<-temp1[,c(1:2,16,3:15)] #增加年度变量,以便按年分区存储,增量更新 #temp1$Y<-yy if (length(temp[,1])CSV ora2csv2<-function(con=NULL,chunk=100000,filepath=NULL,filename=NULL,sql=NULL,yy=NULL,verbose=FALSE){ # con: connection object to Oracle # chunk: rows to read for each block # filepath: file path for output temp CSV file # sql: sql statement to query # verbose: if print verbose message # colCls: column names and types need to point out clearly, # some columns would cause error if not declared clearly when imported from CSV if(is.null(filepath)) filepath<-c("d:/temp/data") if(is.null(filename)) filename<-paste("gsmx20",as.character(as.numeric(Sys.time())),".csv",sep="") outFn<-paste(filepath,"/",filename,sep="") # unable to query, return NULL if(is.null(con)||is.null(sql)) return(NULL) if(file.exists(outFn)) unlink(outFn) if(verbose){ cat("row numbers per block:",as.character(chunk),"\n") cat("sql:",sql,"\n") } # perform the query and record time elapsed t1<-proc.time() rs <- dbSendQuery(con, sql) readMore<-TRUE i<-0 records<-0 # read all chunks into a csv file while(readMore){ i<-i+1 temp <- fetch(rs, n = chunk) #读进来后要在这里做转换处理 #脱敏处理,删除身份证号码与姓名 temp$MD5SFZJHM<-"" temp1<-digestMD5(temp,"SFZJHM","MD5SFZJHM") temp1$SFZJHM<-temp1$XM<-NULL #调整列次序 temp1<-temp1[,c(1:2,16,3:15)] #增加年度变量,以便按年分区存储,增量更新 #temp1$Y<-yy if (length(temp[,1])2005 and to_char(b.SFSSQ_ZZRQ,'yyyy')<2013 and b.yzzf_bj is null --//and b.nsgrnbm=1016925062 order by a.sfzmlb_dm,a.sfzmhm,a.xm" grxx_djz<-readSfzm(con1,sql) #金三个税名单2013~2018 sql<-"select distinct to_char(b.djxh) as grxh,b.sfzjlx_dm, b.sfzjhm,b.xm from gs_cxtj.sb_kjgrsdsbgb a,gs_cxtj.sb_kjgrsdsbgb_mx b where a.JYLSH=b.JYLSH and a.sbxh=b.sbxh and a.yxbz='Y' --//and djxh=20124400001016925062 order by b.sfzjlx_dm, b.sfzjhm,b.xm" grxx_js<-readSfzm(con2,sql) #金三身份证类型代码 sql<-"select sfzjlx_dm,sfzjlxmc from hx_dm_qg.dm_gy_sfzjlx where xybz='Y' order by sfzjlx_dm" rs <- dbSendQuery(con2, sql) sfzjlx_dm<-fetch(rs, n = -1) dbClearResult(rs) #看看各种身份证件类型的人数 temp1<-sqldf("select sfzjlx_dm,count(*) as c from grxx_djz group by sfzjlx_dm order by sfzjlx_dm") temp2<-sqldf("select sfzjlx_dm,count(*) as c from grxx_js group by sfzjlx_dm order by sfzjlx_dm") #看看大集中不同证件类型间证件号码有没有重复 temp3<-sqldf("select sfzjhm,count(*) as c from grxx_djz group by sfzjhm order by c desc") sqldf("select count(*) from temp3 where c>1") #看看金三不同证件类型间证件号码有没有重复 temp4<-sqldf("select sfzjhm,count(*) as c from grxx_js group by sfzjhm order by c desc") sqldf("select count(*) from temp4 where c>1") #大集中征收品目代码 sql<-"select zspm_dm,mc from db_xtwh.t_dm_gy_zspm where zsxm_dm='06' and zspm_dm<>'XXXX' order by zspm_dm" rs <- dbSendQuery(con1, sql) zspm_djz<-fetch(rs, n = -1) dbClearResult(rs) #大集中所得项目代码 sql<-"select zspm_dm,sdxmzm_dm,mc from db_xtwh.t_dm_gs_sdxmzm order by zsxm_dm,zspm_dm,sdxmzm_dm" rs <- dbSendQuery(con1, sql) sdxm_djz<-fetch(rs, n = -1) dbClearResult(rs) #金三征收品目代码 sql<-"select zspm_dm,zspmmc,sl_1 as sl from gs_cxtj.dm_gs_zspm where xybz='Y' order by zspm_dm,zspmmc,sl_1" rs <- dbSendQuery(con2, sql) zspm_js<-fetch(rs, n = -1) dbClearResult(rs) #金三所得项目代码 sql<-"select GRSDSSDXM_DM,GRSDSSDXMMC from hx_dm_qg.dm_gy_grsdssdxm where xybz='Y' order by GRSDSSDXM_DM" rs <- dbSendQuery(con2, sql) sdxm_js<-fetch(rs, n = -1) dbClearResult(rs) #大集中2006年12月个税明细数据,用于测试转换程序 sql<-"select to_char(a.nsgrnbm) as grxh, a.sfzmlb_dm as sfzjlx_dm,a.sfzmhm as sfzjhm,a.xm,b.zspm_dm,b.sdxmzm_dm as sdxm_dm, to_char(b.sdq_qsrq,'yyyy-mm-dd') as sdqq,to_char(b.sdq_zzrq,'yyyy-mm-dd') as sdqz, to_char(b.sfssq_qsrq,'yyyy-mm-dd') as ssqq,to_char(b.sfssq_zzrq,'yyyy-mm-dd') as ssqz, to_char(b.sb_rq,'yyyy-mm-dd') as sbrq,(b.sreze_je+b.jwsreze_je) as sre, b.js_je as ynssde,b.sl, b.bqybtse_je as ynse,b.sj_je as sjse,to_char(b.nsrnbm) as dwxh from db_gsxt.t_gs_grjbxx a, db_gsxt.t_gs_grsbjkmx b where a.nsgrnbm=b.nsgrnbm and b.yzzf_bj is null and to_char(b.sfssq_zzrq,'yyyy-mm')='2006-12'" rs <- dbSendQuery(con1, sql) gsmx0612<-fetch(rs, n = -1) dbClearResult(rs) #取前面1万条测试MD5转换函数的性能 temp5<-head(gsmx0612,10000) t1<-proc.time() temp5$MD5SFZJHM<-"" temp6<-digestMD5(temp5,"SFZJHM","MD5SFZJHM") t2<-proc.time() cat(t2-t1) nchar(temp6$MD5SFZJHM[1]) #读入征收品目对照表 zspm_dz<-read.xlsx("d:/temp/data/个人所得税征收品目对照表.xlsx") zspm_dz$MC<-NULL #取前面1万条测试征收品目转换到金三 temp5<-head(gsmx0612,10000) temp6<-merge(temp5,zspm_dz,by="ZSPM_DM",all.x = TRUE) temp6$ZSPM_DM<-NULL temp6<-temp6[,c(1:4,17,5:16)] colnames(temp6)[5]<-c("ZSPM_DM") #读入所得项目对照表 sdxm_dz<-read.xlsx("d:/temp/data/个人所得税所得项目对照表.xlsx") sdxm_dz$MC<-NULL #取前面1万条测试所得项目转换到金三 temp5<-head(gsmx0612,10000) temp6<-merge(temp5,sdxm_dz,by=c("ZSPM_DM","SDXM_DM"),all.x = TRUE) temp6<-temp6[,c(3:6,1,2,18,7:17)] temp6$SDXM_DM<-NULL colnames(temp6)[6]<-c("SDXM_DM") #读入身份证件类型对照表 sfzjlx_dz<-read.xlsx("d:/temp/data/个人所得税身份证件类型对照表.xlsx") sfzjlx_dz$MC<-NULL #取前面1万条测试身份证件类型转换到金三 temp5<-head(gsmx0612,10000) temp6<-merge(temp5,sfzjlx_dz,by=c("SFZJLX_DM"),all.x = TRUE) temp6$SFZJLX_DM<-NULL temp6<-temp6[,c(1,17,2:16)] colnames(temp6)[2]<-c("SFZJLX_DM") #从大集中抽取2006~2012年的个税明细,2013-14年金三上线时已迁移到金三 y<-c("06","07","08","09","10","11","12") for(i in 2:length(y)){ #i<-2 sql<-paste("select to_char(a.nsgrnbm) as grxh, a.sfzmlb_dm as sfzjlx_dm,a.sfzmhm as sfzjhm,a.xm,b.zspm_dm,b.sdxmzm_dm as sdxm_dm, to_char(b.sdq_qsrq,'yyyy-mm-dd') as sdqq,to_char(b.sdq_zzrq,'yyyy-mm-dd') as sdqz, to_char(b.sfssq_qsrq,'yyyy-mm-dd') as ssqq,to_char(b.sfssq_zzrq,'yyyy-mm-dd') as ssqz, to_char(b.sb_rq,'yyyy-mm-dd') as sbrq,(b.sreze_je+b.jwsreze_je) as sre, b.js_je as ynssde,b.sl, b.bqybtse_je as ynse,b.sj_je as sjse,to_char(b.nsrnbm) as dwxh from db_gsxt.t_gs_grjbxx a, db_gsxt.t_gs_grsbjkmx b where a.nsgrnbm=b.nsgrnbm and b.yzzf_bj is null and to_char(b.sfssq_zzrq,'yyyy')='20",y[i],"'",sep="") ora2csv(con1,10000,path,paste("gsmx20",y[i],".csv",sep=""),sql,y[i],TRUE) } #金三2013年12月个税明细数据,用于测试转换程序 sql<-"select to_char(b.djxh) as grxh,b.sfzjlx_dm,b.sfzjhm,b.xm,b.zspm_dm,b.grsdssdxm_dm as sdxm_dm,to_char(b.sdqjq,'yyyy-mm-dd') as sdqq, to_char(b.sdqjz,'yyyy-mm-dd') as sdqz,to_char(a.sksssqq,'yyyy-mm-dd') as ssqq,to_char(a.sksssqz,'yyyy-mm-dd') as ssqz, to_char(b.lrrq,'yyyy-mm-dd') as sbrq,b.sre,b.ynssde,b.sl_1 as sl,b.ynse,b.ybtse as sjse,to_char(a.kjywrdjxh) as dwxh from gs_cxtj.sb_kjgrsdsbgb a,gs_cxtj.sb_kjgrsdsbgb_mx b where a.JYLSH=b.JYLSH and a.sbxh=b.sbxh and a.yxbz='Y' and to_char(a.sksssqz,'yyyy-mm')='2013-12' " rs <- dbSendQuery(con2, sql) gsmx1312<-fetch(rs, n = -1) dbClearResult(rs) #取前面1万条测试MD5转换函数的性能 temp5<-head(gsmx1312,10000) t1<-proc.time() temp5$MD5SFZJHM<-"" temp6<-digestMD5(temp5,"SFZJHM","MD5SFZJHM") t2<-proc.time() cat(t2-t1) nchar(temp6$MD5SFZJHM[1]) #金三抽取2013-2018年的数据 y<-c("13","14","15","16","17","18") for(i in 1:length(y)){ #i<-1 sql<-paste("select to_char(b.djxh) as grxh,b.sfzjlx_dm,b.sfzjhm,b.xm,b.zspm_dm,b.grsdssdxm_dm as sdxm_dm,to_char(b.sdqjq,'yyyy-mm-dd') as sdqq, to_char(b.sdqjz,'yyyy-mm-dd') as sdqz,to_char(a.sksssqq,'yyyy-mm-dd') as ssqq,to_char(a.sksssqz,'yyyy-mm-dd') as ssqz, to_char(b.lrrq,'yyyy-mm-dd') as sbrq,b.sre,b.ynssde,b.sl_1 as sl,b.ynse,b.ybtse as sjse,to_char(a.kjywrdjxh) as dwxh from gs_cxtj.sb_kjgrsdsbgb a,gs_cxtj.sb_kjgrsdsbgb_mx b where a.JYLSH=b.JYLSH and a.sbxh=b.sbxh and a.yxbz='Y' and to_char(a.sksssqz,'yyyy')='20",y[i],"'",sep="") ora2csv2(con2,10000,path,paste("gsmx20",y[i],".csv",sep=""),sql,y[i],TRUE) } #读入阿里云提取的个税代缴单位序号,约20万个 #并据此建立金三与大集中代缴单位税务登记对照表 dwxh<-read.xlsx("d:/temp/个税代扣单位序号.xlsx") #大集中税务登记 sql<-"select to_char(nsrnbm) as nsrnbm,nsr_mc,swdjlb_dm,swdjzh,zzjg_dm,shxydm,zclx_dm,hy_dm,dj_zt,to_char(djxh) as djxh from db_sbfdb.t_dj_jgnsr order by nsrnbm" rs <- dbSendQuery(con1, sql) swdjdjz<-fetch(rs, n = -1) dbClearResult(rs) #金三税务登记 sql<-"select to_char(djxh) as djxh,nsrmc,ssdabh,nsrsbh,KZZTDJLX_DM,zzjg_dm,shxydm,DJZCLX_DM,HY_DM,NSRZT_DM from hx_dj.dj_nsrxx order by djxh" rs <- dbSendQuery(con2, sql) swdjjs<-fetch(rs, n = -1) dbClearResult(rs) #长度20位的是金三的登记序号,其他是大集中纳税人内部码 temp4<-sqldf("select length(dwxh) as dwxh, count(*) as c from dwxh group by length(dwxh) order by length(dwxh)") #尝试按名称匹配,看看匹配程度,48万条,效果比较好 swdj1<-sqldf("select a.nsrnbm, b.djxh, b.nsrmc from swdjdjz a,swdjjs b where a.nsr_mc=b.nsrmc") #看看匹配的登记序号中有多少在金三还存在,20.2万个中有15.5万个(有重复),其他的或者是已经没有在金三申报了,或者是没有匹配 swdj2<-sqldf("select * from swdj1 where djxh in(select dwxh from dwxh)") #在金三中存在并且匹配的有14.3万条 djxh1<-sqldf("select * from dwxh where dwxh in(select djxh from swdj1 where length(djxh)=20)") #金三中存在但不匹配的有1539条 djxh2<-sqldf("select * from dwxh where dwxh not in(select * from djxh1) and length(dwxh)=20") #税务登记金三有大集中没有的,纳税人内部码置为登记序号,349户, #其他1539-349=1190户金三税务登记不在下发库,要了解一下原因,可能是外来经营临时缴个税 swdj3<-sqldf("select b.djxh as NSRNBM,b.DJXH,b.SSDABH,b.NSRMC,b.KZZTDJLX_DM,b.DJZCLX_DM, b.HY_DM,b.NSRZT_DM from swdjjs b where djxh in (select dwxh from djxh2)") #去重后也是349户 temp7<-sqldf("select nsrnbm,count(*) as c from swdj3 group by nsrnbm order by c desc") temp8<-sqldf("select djxh,count(*) as c from swdj3 group by djxh order by c desc") #税务登记大集中有金三没有的,17885,这些登记序号应该已注销了 djxh3<-sqldf("select dwxh from dwxh where dwxh not in(select distinct nsrnbm from swdj2) and length(dwxh)<20") swdj4<-sqldf("select nsrnbm,NSRNBM AS DJXH,NSRNBM AS SSDABH,nsr_mc as NSRMC,SWDJLB_DM AS KZZTDJLX_DM,ZCLX_DM AS DJZCLX_DM, HY_DM,DJ_ZT AS NSRZT_DM from swdjdjz where nsrnbm in (select * from djxh3)") #去重后有17693户 temp7<-sqldf("select nsrnbm,count(*) as c from swdj4 group by nsrnbm order by c desc") temp8<-sqldf("select djxh,count(*) as c from swdj4 group by djxh order by c desc") #这些登记序号是从大集中延续到金三的,155507 swdj5<-sqldf("select a.nsrnbm,b.DJXH,b.SSDABH,b.NSRMC,b.KZZTDJLX_DM,b.DJZCLX_DM, b.HY_DM,b.NSRZT_DM from swdj2 a, swdjjs b where length(a.djxh)=20 and a.djxh=b.djxh") #去重后有154757 temp7<-sqldf("select nsrnbm,count(*) as c from swdj5 group by nsrnbm order by c desc") temp8<-sqldf("select djxh,count(*) as c from swdj5 group by djxh order by c desc") #三部分合并后就是大集中金三税务登记对照表 swdj6<-rbind(swdj4,swdj5,swdj3) swdj6<-swdj6[!duplicated(swdj6),] swdj6<-swdj6[order(swdj6$NSRNBM),] #上传到阿里云的对照表,屏蔽掉纳税人名称 swdj7<-swdj6 swdj7$NSRMC<-NULL #对照表去重 temp3<-sqldf("select nsrnbm,djxh from swdj7") temp3<-temp3[!duplicated(temp3),] #看看多对多的情况 temp4<-sqldf("select nsrnbm,count(*) as c from temp3 group by nsrnbm order by c desc") #一个纳税人内部码对应多个登记序号的情况不多,526个 sqldf("select count(*) as c from temp4 where c>1") temp5<-sqldf("select djxh,count(*) as c from temp3 group by djxh order by c desc") #4970,一个登记序号对应多个纳税人内部码的情况较多,采取纳税人内部码转换为登记序号升级的方式比较合适 sqldf("select count(*) as c from temp5 where c>1") #这一户一个登记序号对应603个纳税人内部码 temp6<-sqldf("select * from swdj6 where djxh='10124404000003617846'") #输出上传到阿里云的对照表,屏蔽掉纳税人名称 wb<-createWorkbook() addWorksheet(wb=wb,sheetName = "大集中金三个税代缴单位对照表") writeDataTable(wb=wb,sheet = 1,x=swdj7, colNames=TRUE) saveWorkbook(wb,"d:/temp/大集中金三个税代缴单位对照表.xlsx",overwrite = TRUE) #输出自己用的对照表,保留纳税人名称 wb<-createWorkbook() addWorksheet(wb=wb,sheetName = "大集中金三个税代缴单位对照表") writeDataTable(wb=wb,sheet = 1,x=swdj6, colNames=TRUE) saveWorkbook(wb,"d:/temp/大集中金三个税代缴单位对照表2.xlsx",overwrite = TRUE) #课征主体类型代码表 sql<-"select kzztdjlx_dm,kzztdjlxmc from hx_dm_qg.dm_dj_kzztdjlx order by kzztdjlx_dm" rs <- dbSendQuery(con2, sql) kzztdjlx<-fetch(rs, n = -1) dbClearResult(rs) #输出课征主体类型代码表 wb<-createWorkbook() addWorksheet(wb=wb,sheetName = "课征主体登记类型") writeDataTable(wb=wb,sheet = 1,x=kzztdjlx, colNames=TRUE) saveWorkbook(wb,"d:/temp/data/课征主体登记类型.xlsx",overwrite = TRUE) #登记注册类型代码表 sql<-"select djzclx_dm,djzclxmc from hx_dm_qg.dm_dj_djzclx order by djzclx_dm" rs <- dbSendQuery(con2, sql) djzclx<-fetch(rs, n = -1) dbClearResult(rs) #输出登记注册类型代码表 wb<-createWorkbook() addWorksheet(wb=wb,sheetName = "登记注册类型") writeDataTable(wb=wb,sheet = 1,x=djzclx, colNames=TRUE) saveWorkbook(wb,"d:/temp/data/登记注册类型.xlsx",overwrite = TRUE) #行业代码 sql<-"select hy_dm,hymc from hx_dm_qg.dm_gy_hy order by hy_dm" rs <- dbSendQuery(con2, sql) hydm<-fetch(rs, n = -1) dbClearResult(rs) #输出行业代码 wb<-createWorkbook() addWorksheet(wb=wb,sheetName = "行业代码") writeDataTable(wb=wb,sheet = 1,x=hydm, colNames=TRUE) saveWorkbook(wb,"d:/temp/data/行业代码.xlsx",overwrite = TRUE) #纳税人状态代码 sql<-"select nsrzt_dm,nsrztmc from hx_dm_qg.dm_gy_nsrzt" rs <- dbSendQuery(con2, sql) nsrzt<-fetch(rs, n = -1) dbClearResult(rs) #输出纳税人状态代码 wb<-createWorkbook() addWorksheet(wb=wb,sheetName = "纳税人状态") writeDataTable(wb=wb,sheet = 1,x=nsrzt, colNames=TRUE) saveWorkbook(wb,"d:/temp/data/纳税人状态.xlsx",overwrite = TRUE) #抽取合并企业所得税年报,用于个税风险微观分析 #合并企业所得税年报时,匹配国地税的税务登记,因为个税是地税征收,而企业所得税,国地税都有征收 #对于国税征收企业所得税的,要先建立与地税之间税务登记的匹配关系 #同一个单位在国地税税收档案号相同,但都有一个税收档案号对应多个登记序号的情况 #一般来说,申报企业所得税的就是其总部,主要的税务登记。 #地税企业所得税年报 sql<-"select to_char(a.djxh) as djxhds,c.ssdabh,to_char(a.pzxh) as yzpzxh, to_char(a.skssqq,'yy-mm-dd') as skssqq,to_char(a.skssqz,'yy-mm-dd') as skssqz,b.*,d.GZXJZCZZJE,d.GZXJZCSSJE,d.GZXJZCNSTZJE from hx_sb.sb_sbb a left join hx_dj.dj_nsrxx c on a.djxh=c.djxh, hx_sb.SB_SDS_JMCZ_14ND_QYSDSNDNSSBZB b left join HX_SB.SB_SDS_JMCZ_14ND_ZGXCNSTZMXB d on b.sbuuid=d.SBUUID where a.sbuuid=b.sbuuid and a.zfbz_1='N' order by a.djxh,a.lrrq " rs <- dbSendQuery(con2, sql) nbds<-fetch(rs, n = -1) dbClearResult(rs) #每年取最后一份 nbds2<-sqldf("select a.* from nbds a where LRRQ=(select max(LRRQ) from nbds where skssqq=a.skssqq AND skssqz=A.skssqz and djxhds=a.djxhds)") #国税纳税数据,用于匹配地税税务登记 sql<-"select to_char(a.djxh) as djxh ,b.ssdabh,to_char(a.SKSSQQ,'yyyy') as y, sum(a.ybtse) as ybtse from gdgs_zs.zs_yjsf a left join gdgs_dj.dj_nsrxx b on a.djxh=b.djxh where a.skcllx_dm='1' --//开票 --//and yjskzt_dm not in('12','13') --//非 本年新欠 往年陈欠 取消该限制 and a.skzl_dm='10' --//正税 and a.czlx_dm in ('10','23','24','28','72','79','89','90','91','99') --// 10 申报 23 开具通用完税证 24 开具通用缴款书 28 税票录入 72 预缴税费开票 79 录入其他应缴税费(上线前) --// 89 委托代征完税证录入 90 稽查查补不加收滞纳金 91 稽查查补不加收罚款 99 迁移数据 and a.tzlx_dm in ('1','4') and a.sjgsdq like '14404%' --//and zsxm_dm in('10101','10102','10103','10109','30203','30216') group by to_char(a.djxh) ,b.ssdabh,to_char(a.SKSSQQ,'yyyy') order by to_char(a.djxh) ,b.ssdabh,to_char(a.SKSSQQ,'yyyy')" rs <- dbSendQuery(con2, sql) gsjs<-fetch(rs, n = -1) dbClearResult(rs) #300747,统计每个税收档案号每年有多少个登记序号有税收业务发生 temp<-sqldf("select ssdabh,y,count(*) as c from gsjs group by ssdabh,y order by c desc,ssdabh") #4401,1.5%,一号多登记的情况不多 sqldf("select count(*) as c from temp where c>1") #每年取税收最多的国税登记序号匹配 gsjs2<-sqldf("select a.* from gsjs a where YBTSE=(select max(YBTSE) from gsjs where SSDABH=a.SSDABH AND Y=A.Y)") #每个税收档案号只取一个登记序号 gsjs3<-sqldf("select distinct ssdabh,djxh as DJXHGS from gsjs2 order by ssdabh,djxh") #与地税按税收档案号匹配,191544 nbds3<-merge(gsjs3,nbds2,by="SSDABH",all=TRUE) #50535,国税没有业务发生 sqldf("select count(*) from nbds3 where djxhgs is null") #99290,地税有报年报的 nbds3<-sqldf("select * from nbds3 where djxhds is not null") #调整列顺序,以便与国税年报合并 nbds3<-nbds3[,c(3,2,1,seq(4:60))] nbds3$SSDABH.1<-nbds3$DJXHGS.1<-nbds3$DJXHDS.1<-NULL #标记为地税管户 nbds3$GHLX<-"地税" #28378 户有在地税报年报的,swdj6,代缴单位大集中金三税务登记对照表 temp<-sqldf("select distinct djxh from swdj6 where djxh in(select distinct djxhds from nbds3)") #国税企业所得税年报 sql<-"select to_char(a.djxh) as djxhgs,c.ssdabh,to_char(a.pzxh) as yzpzxh, to_char(a.skssqq,'yy-mm-dd') as skssqq,to_char(a.skssqz,'yy-mm-dd') as skssqz,b.*,d.GZXJZCZZJE,d.GZXJZCSSJE,d.GZXJZCNSTZJE from gdgs_sb.sb_sbb a left join gdgs_dj.dj_nsrxx c on a.djxh=c.djxh, gdgs_sb.SB_SDS_JMCZ_14ND_QYSDSNDNSSBZB b left join gdgs_sb.SB_SDS_JMCZ_14ND_ZGXCNSTZMXB d on b.sbuuid=d.SBUUID where a.sbuuid=b.sbuuid and a.zfbz_1='N' order by a.djxh,a.lrrq " rs <- dbSendQuery(con2, sql) nbgs<-fetch(rs, n = -1) dbClearResult(rs) #每年取最后一份 nbgs2<-sqldf("select a.* from nbgs a where LRRQ=(select max(LRRQ) from nbgs where skssqq=a.skssqq AND skssqz=A.skssqz and djxhgs=a.djxhgs)") #金三个税工资薪金明细,用于匹配国税税务登记 sql<-"select d.ssdabh,c.* from (select to_char(a.KJYWRDJXH) as DJXH,to_char(a.SKSSSQQ,'yyyy') as y,sum(b.ybtse) as GSJE, SUM(b.SRE) AS SRE, SUM(b.YNSSDE) AS YNSSDE from gs_cxtj.sb_kjgrsdsbgb a,gs_cxtj.sb_kjgrsdsbgb_mx b where a.JYLSH=b.JYLSH and a.sbxh=b.sbxh and a.yxbz='Y' and b.GRSDSSDXM_DM='0101' --//and b.djxh=20124400001016925062 group by a.KJYWRDJXH,to_char(a.SKSSSQQ,'yyyy') ) c left join hx_dj.dj_nsrxx d on c.djxh=d.djxh order by c.DJXH,c.y" rs <- dbSendQuery(con2, sql) jsgs<-fetch(rs, n = -1) dbClearResult(rs) #550142,统计每个税收档案号每年有多少个登记序号有税收业务发生 temp<-sqldf("select ssdabh,y,count(*) as c from jsgs group by ssdabh,y order by c desc,ssdabh") #93,1.7%%,一号多登记的情况极少,可以忽略 sqldf("select count(*) as c from temp where c>1") #每年取税收最多的登记序号匹配 jsgs2<-sqldf("select a.* from jsgs a where GSJE=(select max(GSJE) from jsgs where SSDABH=a.SSDABH AND Y=A.Y)") #每个税收档案号只取一个登记序号 jsgs3<-sqldf("select distinct ssdabh,djxh from jsgs2 order by ssdabh,djxh") #与地税按税收档案号匹配,171053 nbgs3<-sqldf("select b.djxh as DJXHDS,a.* from nbgs2 a,jsgs3 b where a.ssdabh=b.ssdabh order by b.djxh ") #标记为地税管户 nbgs3$GHLX<-"国税" #70788 户有在国税报年报的 temp<-sqldf("select distinct djxh from swdj6 where djxh in(select distinct djxhds from nbgs3)") #合并国地税所得税年报 nb<-rbind(nbds3,nbgs3) #查看合并后各年年报申报的情况 sqldf("select substr(skssqq,1,2) as yy, count(*) as c from nb group by substr(skssqq,1,2) order by substr(skssqq,1,2)") #70788 户有报年报的 temp<-sqldf("select distinct djxh from swdj6 where djxh in(select distinct djxhds from nb)") #查看户数及在国地税的分布 sqldf("select count(distinct djxhds) from nb") sqldf("select count(distinct djxhgs) from nb") sqldf("select count(distinct ssdabh) from nb") #虽然记录数不多27.8万条,变量比较多,58个,输出成Excel导致java堆栈溢出 # wb<-createWorkbook() # addWorksheet(wb=wb,sheetName = "企业所得税年报") # writeDataTable(wb=wb,sheet = 1,x=nb, colNames=TRUE) # saveWorkbook(wb,"d:/temp/data/企业所得税年报.xlsx",overwrite = TRUE) #改输出成CSV write.csv(nb,"d:/temp/企业所得税年报.csv",row.names = FALSE) #看看一个税收档案号有多个登记序号报年报的情况,266546 temp1<-sqldf("select ssdabh,substr(skssqq,1,2) as y,count(*) as c from nb group by ssdabh,substr(skssqq,1,2) order by c desc,ssdabh") #3742,1.4%,不多。 sqldf("select count(*) as c from temp1 where c>1") #看看共有多少个税收档案号(多少户),95756 sqldf("select count(distinct ssdabh) as c from temp1") #看看没有地税登记序号的有多少,0,没有。 sqldf("select count(*) as c from nb where djxhds is null") #输出金三1金三2税务登记对照表 #因为是个税分析,以金三2地税的登记资料为主,补上国地税的纳税人识别号,用于个税微观风险分析 #国税税务登记信息 sql<-"select to_char(djxh) as djxh,nsrmc,ssdabh,nsrsbh,KZZTDJLX_DM,zzjg_dm,shxydm,DJZCLX_DM,HY_DM,NSRZT_DM from gdgs_dj.dj_nsrxx order by djxh" rs <- dbSendQuery(con2, sql) swdjjs2<-fetch(rs, n = -1) dbClearResult(rs) #代缴单位国地税税务登记对照表 swdjdzb<-sqldf("select c.ssdabh as ssdabh,c.djxhds as djxhds,c.djxhgs as djxhgs,a.nsrmc as nsrmc,a.nsrsbh as nsrsbhds, b.nsrsbh as nsrsbhgs,a.kzztdjlx_dm as kzztdjlx_dm,a.zzjg_dm as zzjg_dm,a.shxydm as shxydm, a.djzclx_dm as djzclx_dm,a.hy_dm as hy_dm,a.nsrzt_dm as nsrzt_dm from swdjjs a, swdjjs2 b,(select ssdabh,djxhds,djxhgs from nb) c where a.djxh=c.djxhds and b.djxh=c.djxhgs order by c.ssdabh") swdjdzb<-swdjdzb[!duplicated(swdjdzb),] #输出代缴单位国地税税务登记对照表 wb<-createWorkbook() addWorksheet(wb=wb,sheetName = "金三1金三2税务登记对照表") writeDataTable(wb=wb,sheet = 1,x=swdjdzb, colNames=TRUE) saveWorkbook(wb,"d:/temp/金三1金三2税务登记对照表.xlsx",overwrite = TRUE) #保存工作区 save.image("d:/temp/GeShuiDaShuJu.RData",compress=TRUE)