java + dom4j.jar提取xml文档内容

本文实例为大家分享了java + dom4j.jar提取xml文档内容的具体代码,供大家参考,具体内容如下

资源下载页:点击下载

本例程主要借助几个遍历的操作对xml格式下的内容进行提取,操作不是最优的方法,主要是练习使用几个遍历操作。

xml格式文档内容:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE nitf SYSTEM "http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd">
-<nitf version="-//IPTC//DTD NITF 3.3//EN" change.time="19:30" change.date="June 10, 2005">

-<head>

<title>An End to Nuclear Testing</title>

<meta name="publication_day_of_month" content="7"/>
<meta name="publication_month" content="7"/>
<meta name="publication_year" content="1993"/>
<meta name="publication_day_of_week" content="Wednesday"/>
<meta name="dsk" content="Editorial Desk"/>
<meta name="print_page_number" content="14"/>
<meta name="print_section" content="A"/>
<meta name="print_column" content="1"/>
<meta name="online_sections" content="Opinion"/>

-<docdata>

<doc-id id-string="619929"/>

<doc.copyright year="1993" holder="The New York Times"/>

-<identified-content>

<classifier type="descriptor" class="indexing_service">ATOMIC WEAPONS</classifier>
<classifier type="descriptor" class="indexing_service">NUCLEAR TESTS</classifier>
<classifier type="descriptor" class="indexing_service">TESTS AND TESTING</classifier>
<classifier type="descriptor" class="indexing_service">EDITORIALS</classifier>
<person class="indexing_service">CLINTON, BILL (PRES)</person>
<classifier type="types_of_material" class="online_producer">Editorial</classifier>
<classifier type="taxonomic_classifier" class="online_producer">Top/Opinion</classifier>
<classifier type="taxonomic_classifier" class="online_producer">Top/Opinion/Opinion</classifier>
<classifier type="taxonomic_classifier" class="online_producer">Top/Opinion/Opinion/Editorials</classifier>
<classifier type="general_descriptor" class="online_producer">Nuclear Tests</classifier>
<classifier type="general_descriptor" class="online_producer">Atomic Weapons</classifier>
<classifier type="general_descriptor" class="online_producer">Tests and Testing</classifier>
<classifier type="general_descriptor" class="online_producer">Armament, Defense and Military Forces</classifier>

</identified-content>
</docdata>
<pubdata name="The New York Times" unit-of-measure="word" item-length="390" ex-ref="http://query.nytimes.com/gst/fullpage.html?res=9F0CEFDF1439F934A35754C0A965958260" date.publication="19930707T000000"/>

</head>

-<body>

-<body.head>

-<hedline>

<hl1>An End to Nuclear Testing</hl1>

</hedline>
</body.head>

-<body.content>

-<block class="lead_paragraph">

<p>For nearly half a century, test explosions in the Nevada desert were a reverberating reminder of cold war insecurity. Now the biggest worry is nuclear proliferation, not the Soviet threat. That's why President Clinton has quietly decided to extend the moratorium on tests of nuclear arms for at least 15 months.</p>
<p>To persuade nuclear have-nots to stay out of the bomb-making business, it makes more sense to halt testing and try to get others to do likewise than to conduct more demonstrations of America's deterrent power.</p>

</block>

-<block class="full_text">

<p>For nearly half a century, test explosions in the Nevada desert were a reverberating reminder of cold war insecurity. Now the biggest worry is nuclear proliferation, not the Soviet threat. That's why President Clinton has quietly decided to extend the moratorium on tests of nuclear arms for at least 15 months.</p>
<p>To persuade nuclear have-nots to stay out of the bomb-making business, it makes more sense to halt testing and try to get others to do likewise than to conduct more demonstrations of America's deterrent power.</p>
<p>Not that nuclear wannabes will necessarily follow America's lead. Nor will an end to all testing assure an end to bomb-making; states like Pakistan have developed nuclear devices without testing them first.</p>
<p>But calling a halt to U.S. nuclear testing makes it easier for leaders in Russia and France to extend the moratoriums they are now observing and improve the atmosphere for prompt negotiation of a treaty to ban all tests.</p>
<p>That test ban in turn should shore up international support for the 1968 Nonproliferation Treaty, linchpin of efforts to stop the spread of nuclear arms, when it comes up for review in 1995. It will also bolster the backing for tighter controls on exports used in bomb-making.</p>
<p>Mr. Clinton has taken three helpful steps. He has extended the Congressionally mandated moratorium on U.S. tests that was due to expire last week. He has declared that the U.S. will not test unless another nation does so first. And he wants to negotiate a total ban on testing.</p>
<p>But the President also wants the nuclear labs to be prepared for a prompt resumption of warhead safety and reliability tests. This could cost millions of dollars and doesn't make much sense, since in Mr. Clinton's own words, "After a thorough review, my Administration has determined that the nuclear weapons in the United States' arsenal are safe and reliable."</p>
<p>Moreover, preparations for testing can take on a life of their own: 30 years after the Limited Test Ban Treaty put an end to above-ground tests, the U.S. still spends $20 million a year on Safeguard C, a program to keep test sites ready.</p>
<p>American security no longer rests on that sort of eternal nuclear vigilance. Mr. Clinton's moratorium may make America safer than all the tests and preparations for tests that the nuclear labs can dream up.</p>

</block>

</body.content>

</body>

</nitf>

提取代码:

对多文件进行操作,首先遍历所有文件路径,存到遍历器中,然后对遍历器中的文件路径进行逐一操作。

package com.njupt.ymh;

import java.io.File;
import java.util.ArrayList;
import java.util.List;

import edu.princeton.cs.algs4.In;

/**
 * 返回文件名列表
 * @author 11860
 *
 */
public class SearchFile {

 public static List<String> getAllFile(String directoryPath,boolean isAddDirectory) {
  List<String> list = new ArrayList<String>(); // 存放文件路径
  File baseFile = new File(directoryPath); // 当前路径

  if (baseFile.isFile() || !baseFile.exists())
   return list;

  File[] files = baseFile.listFiles(); // 子文件
  for (File file : files) {
   if (file.isDirectory())
   {
    if(isAddDirectory) // isAddDirectory 是否将子文件夹的路径也添加到list集合中
     list.add(file.getAbsolutePath()); // 全路径

    list.addAll(getAllFile(file.getAbsolutePath(),isAddDirectory));
   }
   else
   {
    list.add(file.getAbsolutePath());
   }
  }
  return list;
 }
 public static void main(String[] args) {

 //SearchFile sFile = new SearchFile();
 List<String> listFile = SearchFile.getAllFile("E:\\huadai", false);
 System.out.println(listFile.size());
 File file = new File(listFile.get(3));
 In in = new In(listFile.get(4));
 while (in.hasNextLine()) {
 String readLine = in.readLine().trim(); // 读取当前行
 System.out.println(readLine);

 }
 System.out.println(file.length());

 }

}
package com.njupt.ymh;

import java.io.File;
import java.util.Iterator;
import java.util.List;

import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.Element;
import org.dom4j.Node;
import org.dom4j.io.SAXReader;

public class NewsPaper {
 int doc_id; // 文章id
 String doc_title; // 文章标题
 String lead_paragraph ; // 文章首段
 String full_text; // 文章内容
 String date; // 文章日期
 public NewsPaper(String xml) {
 doc_id = -1; // 文章id
 doc_title = null; // 文章标题
 lead_paragraph = null; // 文章首段
 full_text = null; // 文章内容
 date = null; // 文章日期
 searchValue(xml);
 }

 /**
 * 加载Document文件
 * @param fileName
 * @return Document
 */
 private Document load(String fileName) {
 Document document = null; // 文档
 SAXReader saxReader = new SAXReader(); // 读取文件流

 try {
 document = saxReader.read(new File(fileName));
 } catch (DocumentException e) {
 e.printStackTrace();
 }

 return document;
 }

 /**
 * 获取Document的根节点
 * @param args
 */
 private Element getRootNode(Document document) {
 return document.getRootElement();
 }

 /**
 * 获取所需节点值
 * @param xml
 */
 private void searchValue(String xml) {
 Document document = load(xml);
  Element root = getRootNode(document); // 根节点 

  // 文章日期
  date = xml.substring(10, 20);
  // 文章标题
  doc_title = root.valueOf("//head/title");

  // 文章-id
  List<Node> list_doc_id = document.selectNodes("//doc-id/@id-string");
  for(Node ele:list_doc_id){
   doc_id = Integer.parseInt(ele.getText());
  }

  // 文章内容
  for (Iterator<Element> i = root.elementIterator(); i.hasNext();) {
   Element el = (Element) i.next(); // head、body

   // 对body节点进行操作
   if (el.getName() == "body") { // body
    for (Iterator<Element> body = el.elementIterator(); body.hasNext();) {
  Element elbody = body.next();

  if (elbody.getName() == "body.content") { //body.content
  for (Iterator<Element> block = elbody.elementIterator(); block.hasNext();) {
  Element block_class = (Element) block.next();

  if (block_class.attributeValue("class").equals("full_text") ) { // full_text
  List<Node> list_text = block_class.selectNodes("p");
  for (Node text : list_text)
   if (full_text == null)
   full_text = text.getStringValue();
   else
   full_text = full_text +" " + text.getStringValue();
  }

  else { // lead_paragraph
  List<Node> list_lead = block_class.selectNodes("p");
  for (Node lead : list_lead)
   if (lead_paragraph == null)
   lead_paragraph = lead.getStringValue();
   else
   lead_paragraph = lead_paragraph +" "+ lead.getStringValue();
  }
  }
  }
 }
   }
  }
 }

 /**
 * 获取文章标题
 * @param args
 */
 public String getTitle() {
 return doc_title;
 }

 /**
 * 获取文章id
 * @param args
 */
 public int getID() {
 return doc_id;
 }

 /**
 * 获取文章简介
 * @param args
 */
 public String getLead() {
 if (getID() < 394070 && lead_paragraph != null && lead_paragraph.length() > 6)  //1990-10-22之前
 return lead_paragraph.substring(6);
 else       //1990-10-22之后
 return lead_paragraph;
 }

 /**
 * 获取文章正文
 * @param args
 */
 public String getfull() {
 if (getID() < 394070 && full_text != null && full_text.length() > 6)   //1990-10-22之前
 return full_text.substring(6);
 else
 return full_text;
 }

 /**
 * 获取文章日期
 * @param args
 */
 public String getDate() {
 return date;
 }

 /**
 * 判断获取的信息是否有用
 * @return
 */
 public boolean isUseful() {
 if (getID() == -1)
 return false;
 if (getDate() == null )
 return false;
 if (getTitle() == null || getTitle().length() >= 255)
 return false;
 if (getLead() == null || getLead().length() >= 65535 )
 return false;
 if (getfull() == null || getfull().length() >= 65535)
 return false;

 return !isnum();
 }

 /**
 * 挑出具有特殊开头的数字内容文章
 * @return
 */
 private boolean isnum() {
 if (getfull() != null && getfull().length() > 24) {
 if (getfull().substring(0, 20).contains("*3*** COMPANY REPORT") ) { // 剔除数字文章
 return true;
 }
 }
 return false;
 }

 public static void main(String[] args) {
 List<String> listFile = SearchFile.getAllFile("E:\\huadai\\1989\\10", false); // 文件列表
 //String date; // 日期
 int count = 0;
 int i = 0;
 for (String string : listFile) {
 NewsPaper newsPaper = new NewsPaper(string);
 count++;
 if (!newsPaper.isUseful()) {
 i++;
 System.out.println(newsPaper.getLead());
 }
 }

 System.out.println(i + " "+ count);

 }
}

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持我们。

(0)

相关推荐

  • Java文件批量重命名批量提取特定类型文件

    原因: 因为在网上下载视频教程,有的名字特别长,一般都是机构或者网站的宣传,不方便直接看到视频的简介,所以做了下面的第一个功能. 因为老师发的课件中,文件夹太多,想把docx都放在同一个文件夹下面,一个一个找出来太麻烦,所以做了第二个功能. 最近刚刚学了Java文件和流的知识,所以正好练练手,这也是自己的第一个exe程序,分享一下哈. (导出jar文件,以及用工具exe4j生成exe文件,这部分省略了哈) 用到的知识: 用到Java中文件,流的知识,以及简单的GUI知识. 功能: 功能一:去除文

  • Java实现从Html文本中提取纯文本的方法

    1.应用场景:从一份html文件中或从String(是html内容)中提取纯文本,去掉网页标签: 2.代码一:replaceAll搞定 //从html中提取纯文本 public static String StripHT(String strHtml) { String txtcontent = strHtml.replaceAll("</?[^>]+>", ""); //剔出<html>的标签 txtcontent = txtcont

  • Java实现解析dcm医学影像文件并提取文件信息的方法示例

    本文实例讲述了Java实现解析dcm医学影像文件并提取文件信息的方法.分享给大家供大家参考,具体如下: 一.安装 首先去Github下载源码,然后执行mvn install进行本地安装,Maven中央仓库,竟然没有该jar..安装成功之后如下: 然后在POM.XML文件中引入该jar包: <dependency> <groupId>org.dcm4che</groupId> <artifactId>dcm4che-core</artifactId>

  • java + dom4j.jar提取xml文档内容

    本文实例为大家分享了java + dom4j.jar提取xml文档内容的具体代码,供大家参考,具体内容如下 资源下载页:点击下载 本例程主要借助几个遍历的操作对xml格式下的内容进行提取,操作不是最优的方法,主要是练习使用几个遍历操作. xml格式文档内容: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE nitf SYSTEM "http://www.nitf.org/IPTC/NITF

  • Java dom4j创建解析xml文档过程解析

    DOM4J解析 特征: 1.JDOM的一种智能分支,它合并了许多超出基本XML文档表示的功能. 2.它使用接口和抽象基本类方法. 3.具有性能优异.灵活性好.功能强大和极端易用的特点. 4.是一个开放源码的文件 jar包:dom4j-1.6.1.jar 创建 book.xml: package com.example.xml.dom4j; import java.io.FileWriter; import org.dom4j.Document; import org.dom4j.Document

  • java使用DOM对XML文档进行增删改查操作实例代码

    本文研究的主要是java使用DOM对XML文档进行增删改查操作的相关代码,具体实例如下所示. 源代码: package com.zc.homeWork18; import java.io.File; import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.transform.Transformer; import javax.xml.tr

  • 如何使用XPath提取xml文档数据

    本文实例为大家分享了XPath提取xml文档数据具体代码,供大家参考,具体内容如下 import java.util.List; import org.dom4j.Document; import org.dom4j.Node; import org.dom4j.io.SAXReader; import org.junit.Test; /* * 使用XPath查找xml文档数据 * */ public class DemoXPath { @Test //输出book.xml中所有price元素节

  • Java编程中更新XML文档的常用方法

    本文简要的讨论了Java语言编程中更新XML文档的四种常用方法,并且分析这四种方法的优劣.其次,本文还对如何控制Java程序输出的XML文档的格式做了展开论述. JAXP是Java API for XML Processing的英文字头缩写,中文含义是:用于XML文档处理的使用Java语言编写的编程接口.JAXP支持DOM.SAX.XSLT等标准.为了增强JAXP使用上的灵活性,开发者特别为JAXP设计了一个Pluggability Layer,在Pluggability Layer的支持之下,

  • 通过php删除xml文档内容的方法

    本文实例讲述了通过php删除xml文档内容的方法.分享给大家供大家参考.具体实现方法如下: 第一种情况:删除一个student节点 复制代码 代码如下: <?php //1.创建一个DOMDocument对象.该对象就表示 xml文件 $xmldoc = new DOMDocument(); //2.加载xml文件(指定要解析哪个xml文件,此时dom树节点就会加载到内存中) $xmldoc->load("class.xml"); //3.删除一条学生student信息记录

  • 通过php添加xml文档内容的方法

    本文实例讲述了通过php添加xml文档内容的方法.分享给大家供大家参考.具体分析如下: 这里讲述的添加xml文档内容,从上一篇<DOM基础及php读取xml内容操作的方法>继续,代码如下: 复制代码 代码如下: <?php //1.创建一个DOMDocument对象.该对象就表示 xml文件 $xmldoc = new DOMDocument(); //2.加载xml文件(指定要解析哪个xml文件,此时dom树节点就会加载到内存中) $xmldoc->load("clas

  • 通过php修改xml文档内容的方法

    本文实例讲述了通过php修改xml文档内容的方法,分享给大家供大家参考.具体实现方法如下: 复制代码 代码如下: <?php //1.创建一个DOMDocument对象.该对象就表示 xml文件 $xmldoc = new DOMDocument(); //2.加载xml文件(指定要解析哪个xml文件,此时dom树节点就会加载到内存中) $xmldoc->load("class.xml"); //3.更新一条学生student信息记录,更新她的年龄 //(1)找到该学生 $

  • java中四种生成和解析XML文档的方法详解(介绍+优缺点比较+示例)

    众所周知,现在解析XML的方法越来越多,但主流的方法也就四种,即:DOM.SAX.JDOM和DOM4J 下面首先给出这四种方法的jar包下载地址 DOM:在现在的Java JDK里都自带了,在xml-apis.jar包里 SAX:http://sourceforge.net/projects/sax/ JDOM:http://jdom.org/downloads/index.html DOM4J:http://sourceforge.net/projects/dom4j/  一.介绍及优缺点分析

  • Java 添加超链接到 Word 文档方法详解

    在Word文档中,超链接是指在特定文本或者图片中插入的能跳转到其他位置或网页的链接,它也是我们在编辑制作Word文档时广泛使用到的功能之一.今天这篇文章就将为大家演示如何使用Free Spire.Doc for Java在Word文档中添加文本超链接和图片超链接. Jar包导入 方法一:下载Free Spire.Doc for Java包并解压缩,然后将lib文件夹下的Spire.Doc.jar包作为依赖项导入到Java应用程序中. 方法二:通过Maven仓库安装JAR包,配置pom.xml文件

随机推荐