Commit af725d23 authored by David Maus's avatar David Maus
Browse files

Initial commit

parents
#+TITLE: PicaReader -- Classes for reading Pica+ records
#+AUTHOR: David Maus
#+EMAIL: maus@hab.de
* About
PicaReader provides classes for reading Pica+ records encoded in PicaXML and PicaPlain.
PicaReader is copyright (c) 2012 by Herzog August Bibliothek Wolfenbüttel and released under the
terms of the GNU General Public License v3.
* Installation
PicaReader should be installed using the [[http://pear.php.net][PEAR Installer]]. This installer is the PHP community's
de-facto standard for installing PHP packages.
#+BEGIN_EXAMPLE
pear channel-discover hab20.hab.de/service/pear
pear install --alldeps hab20.hab.de/service/pear/PicaReader
#+END_EXAMPLE
* Usage
All readers adhere to the same interface. You open the reader with a string of input data by calling
=Reader::open()= and can call =Reader::read()= to read the next record in the input data. If the
input does not contain (anymore) records =Reader::read()= returns =FALSE=. Otherwise it returns
either a record object created with PicaRecord's =Record::factory()= function.
#+BEGIN_SRC php
$reader = new \HAB\Pica\Reader\PicaXmlReader()
$record = $reader->read(file_get_contents('http://unapi.gbv.de?id=opac-de-23:ppn:635012286&format=picaxml'));
$reader->close();
#+END_SRC
To filter out records or fields you can attach a filter to the reader via =Reader::setFilter()=. A
filter is any valid PHP callback that takes an associative array representing the record as argument
and returns a possibly modified array or =FALSE= if the entire record should be skipped.
The array representation of a record is defined as follows:
#+BEGIN_EXAMPLE
RECORD := array('fields' => array(FIELD, ))
FIELD := array('tag' => TAG, 'occurrence' => OCCURRENCE, 'subfields' => array(SUBFIELD, ))
SUBFIELD := array('code' => CODE, 'value' => VALUE)
#+END_EXAMPLE
Where =TAG=, =OCCURRENCE=, =CODE=, and =VALUE= are the respective properties of a Pica+ field or
subfield.
For example, if your source delivers malformed PicaXML records like so:
#+BEGIN_SRC xml
<?xml version="1.0" encoding="UTF-8"?>
<record xmlns="info:srw/schema/5/picaXML-v1.0">
<datafield tag="">
</datafield>
<datafield tag="001A">
<subfield code="0">0001:14-09-10</subfield>
</datafield>
</record>
#+END_SRC
You can attach a filter function to remove these fields with an invalid tag:
#+BEGIN_SRC php
$reader = new PicaXmlReader();
$reader->setFilter(function (array $r) {
return array('fields' => array_filter($r['fields'],
function (array $f) {
return isset($f['tag']) && \HAB\Pica\Record\Field::isValidFieldTag($f['tag']);
}));
});
$record = $reader->read();
$reader->close();
#+END_SRC
* Development
If you want to patch or enhance this component, you will need to create a suitable development
environment. The easiest way to do that is to install phix4componentdev:
#+BEGIN_EXAMPLE
apt-get install php5-xdebug
apt-get install php5-imagick
pear channel-discover pear.phix-project.org
pear -D auto_discover=1 install -Ba phix/phix4componentdev
#+END_EXAMPLE
You can then clone the Git repository:
#+BEGIN_EXAMPLE
git clone git://gitorious.org/php-pica/picareader.git
#+END_EXAMPLE
Then, install a local copy of the package's dependencies to complete the development environment:
#+BEGIN_EXAMPLE
phing build-vender
#+END_EXAMPLE
To make life easier for you, common tasks (such as running unit tests, generating code review
analytics, and creating the PEAR package) have been automated using [[http://phing.info][Phing]]. You'll find the
automated steps inside the build.xml file that ships with the component.
Run the command 'phing' in the component's top-level folder to see the full list of available
automated tasks.
* Acknowledgements
The [[http://phix-project.org][Phix project]] makes it easy to setup and maintain a package repository for a PEAR-installable
package and integrates important tools such as [[http://phpunit.de][PHPUnit]], [[http://phing.info][Phing]], [[http://pear.php.net][PEAR]], and [[http://pirum.sensiolabs.org/][Pirum]]. Large parts of this
package would not have been possible without studying the source of [[http://search.cpan.org/dist/PICA-Record/][Pica::Record]], an open source
Perl library for handling Pica+ records by Jakob Voß, and the practical knowledge of our library's
catalogers.
* Footnotes
.build
dist
.tmp
nbproject
review
vendor
.#*
#*
TAGS
\ No newline at end of file
syntax: glob
.build
.dist
nbproject
review
tmp
vendor
.#*
#*
TAGS
\ No newline at end of file
This diff is collapsed.
#+TITLE: PicaReader -- Classes for reading Pica+ records
#+AUTHOR: David Maus
#+EMAIL: maus@hab.de
* About
PicaReader provides classes for reading Pica+ records encoded in PicaXML and PicaPlain.
PicaReader is copyright (c) 2012 by Herzog August Bibliothek Wolfenbüttel and released under the
terms of the GNU General Public License v3.
* Installation
PicaReader should be installed using the [[http://pear.php.net][PEAR Installer]]. This installer is the PHP community's
de-facto standard for installing PHP packages.
#+BEGIN_EXAMPLE
pear channel-discover hab20.hab.de/service/pear
pear install --alldeps hab20.hab.de/service/pear/PicaReader
#+END_EXAMPLE
* Usage
All readers adhere to the same interface. You open the reader with a string of input data by calling
=Reader::open()= and can call =Reader::read()= to read the next record in the input data. If the
input does not contain (anymore) records =Reader::read()= returns =FALSE=. Otherwise it returns
either a record object created with PicaRecord's =Record::factory()= function.
#+BEGIN_SRC php
$reader = new \HAB\Pica\Reader\PicaXmlReader()
$record = $reader->read(file_get_contents('http://unapi.gbv.de?id=opac-de-23:ppn:635012286&format=picaxml'));
$reader->close();
#+END_SRC
To filter out records or fields you can attach a filter to the reader via =Reader::setFilter()=. A
filter is any valid PHP callback that takes an associative array representing the record as argument
and returns a possibly modified array or =FALSE= if the entire record should be skipped.
The array representation of a record is defined as follows:
#+BEGIN_EXAMPLE
RECORD := array('fields' => array(FIELD, …))
FIELD := array('tag' => TAG, 'occurrence' => OCCURRENCE, 'subfields' => array(SUBFIELD, …))
SUBFIELD := array('code' => CODE, 'value' => VALUE)
#+END_EXAMPLE
Where =TAG=, =OCCURRENCE=, =CODE=, and =VALUE= are the respective properties of a Pica+ field or
subfield.
For example, if your source delivers malformed PicaXML records like so:
#+BEGIN_SRC xml
<?xml version="1.0" encoding="UTF-8"?>
<record xmlns="info:srw/schema/5/picaXML-v1.0">
<datafield tag="">
</datafield>
<datafield tag="001A">
<subfield code="0">0001:14-09-10</subfield>
</datafield>
</record>
#+END_SRC
You can attach a filter function to remove these fields with an invalid tag:
#+BEGIN_SRC php
$reader = new PicaXmlReader();
$reader->setFilter(function (array $r) {
return array('fields' => array_filter($r['fields'],
function (array $f) {
return isset($f['tag']) && \HAB\Pica\Record\Field::isValidFieldTag($f['tag']);
}));
});
$record = $reader->read(…);
$reader->close();
#+END_SRC
* Development
If you want to patch or enhance this component, you will need to create a suitable development
environment. The easiest way to do that is to install phix4componentdev:
#+BEGIN_EXAMPLE
apt-get install php5-xdebug
apt-get install php5-imagick
pear channel-discover pear.phix-project.org
pear -D auto_discover=1 install -Ba phix/phix4componentdev
#+END_EXAMPLE
You can then clone the Git repository:
#+BEGIN_EXAMPLE
git clone git://gitorious.org/php-pica/picareader.git
#+END_EXAMPLE
Then, install a local copy of the package's dependencies to complete the development environment:
#+BEGIN_EXAMPLE
phing build-vender
#+END_EXAMPLE
To make life easier for you, common tasks (such as running unit tests, generating code review
analytics, and creating the PEAR package) have been automated using [[http://phing.info][Phing]]. You'll find the
automated steps inside the build.xml file that ships with the component.
Run the command 'phing' in the component's top-level folder to see the full list of available
automated tasks.
* Acknowledgements
* Footnotes
<project name="local" default="help">
<target name="help">
<echo message="This component has no local build targets." />
</target>
</project>
<!-- vim: set tabstop=2 shiftwidth=2 expandtab: -->
project.name=PicaReader
project.channel=hab20.hab.de/service/pear
project.majorVersion=0
project.minorVersion=1
project.patchLevel=0
project.snapshot=true
component.type=php-library
component.version=11
This diff is collapsed.
<?xml version="1.0" encoding="UTF-8"?>
<package packagerversion="1.9.1" version="2.0"
xmlns="http://pear.php.net/dtd/package-2.0"
xmlns:tasks="http://pear.php.net/dtd/tasks-1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://pear.php.net/dtd/tasks-1.0
http://pear.php.net/dtd/tasks-1.0.xsd
http://pear.php.net/dtd/package-2.0
http://pear.php.net/dtd/package-2.0.xsd">
<name>${project.name}</name>
<channel>${project.channel}</channel>
<summary>Classes for reading Pica+ records</summary>
<description>
This package provides classes for reading Pica+ records encoded in
PicaXML or PicaPlain.
</description>
<lead>
<name>David Maus</name>
<user>dmaus</user>
<email>maus@hab.de</email>
<active>yes</active>
</lead>
<date>${build.date}</date>
<time>${build.time}</time>
<version>
<release>${project.version}</release>
<api>${project.majorVersion}.${project.minorVersion}</api>
</version>
<stability>
<release>${project.stability}</release>
<api>stable</api>
</stability>
<license>GNU General Public License v3</license>
<notes>
No notes.
</notes>
<contents>
<dir baseinstalldir="/" name="/">
${contents}
</dir>
</contents>
<dependencies>
<required>
<php>
<min>5.3.0</min>
</php>
<pearinstaller>
<min>1.9.4</min>
</pearinstaller>
<package>
<name>Autoloader</name>
<channel>pear.phix-project.org</channel>
<min>3.0.0</min>
<max>3.999.9999</max>
</package>
<package>
<name>PicaRecord</name>
<channel>hab20.hab.de/service/pear</channel>
<min>0.1.0</min>
<max>0.999.9999</max>
</package>
</required>
</dependencies>
<phprelease />
<changelog>
<release>
<version>
<release>0.1.0</release>
<api>0.1</api>
</version>
<stability>
<release>stable</release>
<api>stable</api>
</stability>
<date>2012-02-15</date>
<license>GNU General Public License v3</license>
<notes>
</notes>
</release>
</changelog>
</package>
<!-- vim: set tabstop=2 shiftwidth=2 expandtab: -->
<?xml version="1.0"?>
<phpunit bootstrap="src/tests/unit-tests/bootstrap.php">
<testsuites>
<testsuite name="Unit Tests">
<directory suffix="Test.php">src/tests/unit-tests</directory>
</testsuite>
</testsuites>
<filter>
<blacklist>
<directory suffix=".php">vendor</directory>
<directory suffix=".php">src/tests</directory>
</blacklist>
<whitelist addUncoveredFilesFromWhitelist="true">
<directory suffix=".php">src/bin</directory>
<directory suffix=".php">src/php</directory>
</whitelist>
</filter>
<logging>
<log type="coverage-html" target="review/code-coverage"/>
<log type="coverage-clover" target="review/logs/phpunit.xml"/>
<log type="json" target="review/logs/phpunit.json"/>
<log type="tap" target="review/logs/phpunit.tap"/>
<log type="junit" target="review/logs/phpunit-junit.xml"/>
<log type="testdox-html" target="review/testdox.html"/>
<log type="testdox-text" target="review/testdox.txt"/>
</logging>
</phpunit>
<!-- vim: set tabstop=4 shiftwidth=4 expandtab: -->
Your src/ folder
================
This src/ folder is where you put all of your code for release. There's
a folder for each type of file that the PEAR Installer supports. You can
find out more about these file types online at:
http://blog.stuartherbert.com/php/2011/04/04/explaining-file-roles/
* bin/
If you're creating any command-line tools, this is where you'd put
them. Files in here get installed into /usr/bin on Linux et al.
There is more information available here: http://blog.stuartherbert.com/php/2011/04/06/php-components-shipping-a-command-line-program/
You can find an example here: https://github.com/stuartherbert/phix/tree/master/src/bin
* data/
If you have any data files (any files that aren't PHP code, and which
don't belong in the www/ folder), this is the folder to put them in.
There is more information available here: http://blog.stuartherbert.com/php/2011/04/11/php-components-shipping-data-files-with-your-components/
You can find an example here: https://github.com/stuartherbert/ComponentManagerPhpLibrary/tree/master/src/data
* php/
This is where your component's PHP code belongs. Everything that goes
into this folder must be PSR0-compliant, so that it works with the
supplied autoloader.
There is more information available here: http://blog.stuartherbert.com/php/2011/04/05/php-components-shipping-reusable-php-code/
You can find an example here: https://github.com/stuartherbert/ContractLib/tree/master/src/php
* tests/functional-tests/
Right now, this folder is just a placeholder for future functionality.
You're welcome to make use of it yourself.
* tests/integration-tests/
Right now, this folder is just a placeholder for future functionality.
You're welcome to make use of it yourself.
* tests/unit-tests/
This is where all of your PHPUnit tests go.
It needs to contain _exactly_ the same folder structure as the src/php/
folder. For each of your PHP classes in src/php/, there should be a
corresponding test file in test/unit-tests.
There is more information available here: http://blog.stuartherbert.com/php/2011/08/15/php-components-shipping-unit-tests-with-your-component/
You can find an example here: https://github.com/stuartherbert/ContractLib/tree/master/test/unit-tests
* www/
This folder is for any files that should be published in a web server's
DocRoot folder.
It's quite unusual for components to put anything in this folder, but
it is there just in case.
There is more information available here: http://blog.stuartherbert.com/php/2011/08/16/php-components-shipping-web-pages-with-your-components/
<?php
/**
* The PicaPlainReader class file.
*
* This file is part of PicaReader.
*
* PicaReader is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* PicaReader is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with PicaReader. If not, see <http://www.gnu.org/licenses/>.
*
* @package PicaReader
* @author David Maus <maus@hab.de>
* @copyright Copyright (c) 2012 by Herzog August Bibliothek Wolfenbüttel
* @license http://www.gnu.org/licenses/gpl.html GNU General Public License v3
*/
namespace HAB\Pica\Reader;
/**
* Reader for Pica+ records encoded in PicaPlain.
*
* @package PicaReader
* @author David Maus <maus@hab.de>
* @copyright Copyright (c) 2012 by Herzog August Bibliothek Wolfenbüttel
* @license http://www.gnu.org/licenses/gpl.html GNU General Public License v3
*/
class PicaPlainReader extends Reader {
/**
* Current input data.
*
* @var string
*/
protected $_data;
/**
* Open the reader with input data.
*
* @param string $data Input data
* @return void
*/
public function open ($data) {
parent::open($data);
$this->_data = preg_split("/(?:\n\r|[\n\r])/", $data);
}
/**
* Read the next record in input data.
*
* @see \HAB\Pica\Reader\Reader::next()
*
* @return array|false Array representation of the record or FALSE if no more records
*/
protected function next () {
$record = false;
if (current($this->_data) !== false) {
$record = array('fields' => array());
do {
$line = current($this->_data);
$record['fields'] []= $this->readField($line);
} while (next($this->_data));
next($this->_data);
}
return $record;
}
/**
* Return array representation of the field encoded in a line.
*
* @throws \RuntimeException Invalid characters in line
* @param string $line PicaPlain record line
* @return array Array representation of the encoded field
*/
protected function readField ($line) {
$field = array('subfields' => array());
$match = array();
if (preg_match('#^([012][0-9]{2}[A-Z@])(/([0-9]{2}))? (\$.*)$#Du', $line, $match)) {
$field = array('tag' => $match[1],
'occurrence' => $match[3] ?: null,
'subfields' => $this->parseSubfields($match[4]));;
} else {
throw new \RuntimeException("Invalid characters in PicaPlain record near line {$this->getCurrentLineNumber()}");
}
return $field;
}
/**
* Return array of array representations of the subfields encode in argument.
*
* @param string $str Encoded subfields
* @return array Array representions of the encoded subfields
*/
protected function parseSubfields ($str) {
$subfields = array();
$subfield = null;
$pos = 0;
$max = strlen($str);
$state = '$';
do {
switch ($state) {
case '$':
if (is_array($subfield)) {
$subfields []= $subfield;
$subfield = array();
}
$pos += 1;
$state = 'code';
break;
case 'code':
$subfield['code'] = $str[$pos];
$subfield['value'] = '';
$pos += 1;
$state = 'value';
break;
case 'value':
$next = strpos($str, '$', $pos);
if ($next === false) {
$subfield['value'] .= substr($str, $pos);
$pos = $max;
} else {
$subfield['value'] .= substr($str, $pos, ($next - $pos));
$pos = $next;
if (isset($str[$pos + 1]) && $str[$pos + 1] === '$') {
$subfield['value'] .= '$';
$pos += 2;
} else {
$state = '$';
}
}
break;
}
} while ($pos < $max);
$subfields []= $subfield;
return $subfields;
}
/**
* Close the reader.
*
* @return void
*/
public function close () {
parent::close();
$this->_data = null;
}
/**
* Return the number of the line currently parsed.
*
* @return integer Number of currently parsed line
*/
protected function getCurrentLineNumber () {
return key($this->_data);
}
}